Macs in Chemistry

Insanely great science

 

Several ways of scripting Name to Structure

Too often I come across datasets that Chemical names or identifiers but no actual molecular structure, recently Dan at Dotmatics suggested I look at OPSIN. There are several tools for converting the names to structure and I've highlighted a couple of options here and described scripts that allow them to be used from with Vortex.

OPSIN

OPSIN is a Java(1.6+) library for IUPAC name-to-structure conversion offering high recall and precision on organic chemical nomenclature. Supported outputs are SMILES, CML (Chemical Markup Language) and InChI (IUPAC International Chemical Identifier). The latest version can be downloaded here OPSIN-2.2.0-jar-with-dependencies.jar. To access it from within Vortex you need to put the jar file in the folder

/Users/USERNAME/vortex/libs/

Where USERNAME is your username

OPSIN can be called from the command line using

java -jar OPSIN-2.2.0-jar-with-dependencies.jar -osmi input.txt output.txt

where input.txt contains a series of chemical name/s, one per line. Or for an individual chemical name

NameToStructure nts = NameToStructure.getInstance();
String smiles = nts.parseToSmiles("acetonitrile");

We can use the latter in the Vortex script as shown below. The file containing the chemical names looks like this, each chemical name is on a single line and is in plain text. OPSIN was designed to support IUPAC names but an increasing number of trivial (but widely used) chemical names and synonyms are also supported.

Name
iodobenzene
2,15-dimethyl-14-(1,5-dimethylhexyl)tetracyclo[8.7.0.02,7.011,15]heptadec-7-en-5-ol
acetone
quinuclidine
1-Azabicyclo[2.2.2]octane
2-Methyl-1,3,5-trinitrobenzene
5-{2-Ethoxy-5-[(4-methylpiperazin-1-yl)sulfonyl]phenyl}-1-methyl-3-propyl-1H,6H,7H-pyrazolo[4,3-d]pyrimidin-7-one
Ethyl Magnesium Bromide
Lithium Bromide
Anisole
Phenylalanine

When importing the file into Vortex it is important NOT to use comma as the delimiter or it will break up the chemical names that contain a comma (I used tab).

import

Once the file containing the chemical names has been imported it should look like this.

vortexOpsin

After running the script it should look like this, where Vortex has automatically rendered the SMILES strings as structures.

vortexOpsinStructures

The first part of the script imports OPSIN, then create the dialog box to allow the user to identify the column containing the chemical name. Then loop through the rows in the workspace and for each row generate the SMILES string from the chemical name and put it into a new column called SMILES.

The Vortex Script

# Vortex imports

import os
import sys
sys.path.append("/Users/USERNAME/vortex/libs/OPSIN-2.2.0-jar-with-dependencies.jar") #Need to edit USERNAME to include your username
from uk.ac.cam.ch.wwmm.OPSIN import NameToStructure, NameToStructureConfig
nts = NameToStructure.getInstance()
ntsconfig = NameToStructureConfig

colsmi = vtable.findColumnWithName('SMILES', 0)

input_label = swing.JLabel("Name column (for input)")
input_cb = workspace.getColumnComboBox()
panel = swing.JPanel()

layout.fill(panel, input_label, 0, 0)
layout.fill(panel, input_cb,    1, 0)

ret = vortex.showInDialog(panel, "Choose Chemical Name Column column")

if ret == vortex.OK:
    input_idx = input_cb.getSelectedIndex()

    if input_idx == 0:
        vortex.alert("you must choose a column")
    else:
        col = vtable.getColumn(input_idx - 1)

        rows = vtable.getRealRowCount()
        for r in range(0, int(rows)):
            drugName = col.getValueAsString(r)
            mySMILES = nts.parseToSmiles(drugName)


            colsmi = vtable.findColumnWithName('SMILES', 1)
            colsmi.setValueFromString(r, mySMILES)

vtable.fireTableStructureChanged()

The Vortex script can be downloaded here DrugName2SMILESOPSIN.vpy

Chemical Identifier Resolver

The Chemical Identifier Resolver (CIR) by the CADD Group at the NCI/NIH is a web service that performs various chemical name to structure conversions. The service works as a resolver for different chemical structure identifiers and allows one to convert a given structure identifier into another representation or structure identifier. It can help you identify and find the chemical structure if you have an identifier such as an InChIKey or CAS Number. You can either use the resolver web form at the web link above or use the following simple URL as a web service. Full documetation is here

http://cactus.nci.nih.gov/chemical/structure/"structure identifier"/"representation"

Example: Chemical name to SMILES:

http://cactus.nci.nih.gov/chemical/structure/aspirin/smiles

The input identifier can be a chemical name, SMILES, CAS Number, InChi etc and the returned representation can be SMILES, sdf, png etc.

Chemical names are resolved by a database lookup into a full structure representation. The service has currently approx. 68 million chemical names available linked to approx. 16 million unique structure records. The set of available names includes trivial names, synonyms, systematic names, registry numbers, etc.

Much of the script is similar to the one using OPSIN, the different this that this time we construct the URL for the web service

mystr = "http://cactus.nci.nih.gov/chemical/structure/" + encoded_name + "/smiles"

We encode the drugName to ensure that special characters will not break the URL.

encoded_name = urllib.quote(drugName)

The SMILES returned is then added to the workspace.

The Vortex Script

# Python imports
import urllib2
import urllib

# Vortex imports
import com.dotmatics.vortex.util.Util as Util
import com.dotmatics.vortex.mol2img.jni.genImage as genImage
import com.dotmatics.vortex.mol2img.Mol2Img as mol2Img
import jarray
import binascii
import string
import os

input_label = swing.JLabel("Name column (for input)")
input_cb = workspace.getColumnComboBox()
panel = swing.JPanel()

layout.fill(panel, input_label, 0, 0)
layout.fill(panel, input_cb,    1, 0)

ret = vortex.showInDialog(panel, "Choose Drug Name Column column")

if ret == vortex.OK:
    input_idx = input_cb.getSelectedIndex()

    if input_idx == 0:
        vortex.alert("you must choose a column")
    else:
        col = vtable.getColumn(input_idx - 1)

        # "http://cactus.nci.nih.gov/chemical/structure/" & the_encode_text & "/smiles"

        colsmi = vtable.findColumnWithName('SMILES', 0)

        rows = vtable.getRealRowCount()
        for r in range(0, int(rows)):
            drugName = col.getValueAsString(r)
            encoded_name = urllib.quote(drugName)
            mystr = "http://cactus.nci.nih.gov/chemical/structure/" + encoded_name + "/smiles"
            try:
                myreturn = urllib2.urlopen(mystr).read()
            except urllib2.HTTPError:
                continue
        #       some not found 

            colsmi = vtable.findColumnWithName('SMILES', 1)
        colsmi.setValueFromString(r, myreturn)



vtable.fireTableStructureChanged()

Using CIR is significantly slower but it it is better able to assign structures to trade names etc.

The Vortex script can be downloaded here DrugName2SMILEScir.vpy

ChemSpider

ChemSpider is a free chemical structure database providing fast access to over 58 million structures, properties, and associated information. There are also a series of web services that provide access to the data.

The ChemSpider webservices are a powerful suite of tools that provide access to many of the commonly used features of ChemSpider through Application Programming Interfaces (APIs). The webservices make it possible to enrich your Apps, your website, your in-house data systems and data workflow tools.

To access this web service you will need to register and obtain a security token. Registration does also give you access to a wide range of web services covering structure and spectra searching, and generic conversion between chemical file formats. If you are a Python user you should also look at ChemSciPy. If you use PIP then installation is straightforward.

pip install chemspipy

For many tasks that you might want to perform on ChemSpider (searches etc), there is no need to have a ChemSpider User account. However, if you want to save Results sets, Curate records, add Data or use certain Web services, then you will need to have a ChemSpider account linked to an RSC ID.

You will need to edit the downloaded script to enter your security token.

The Vortex Script

# Python imports
import httplib
import urllib2
import urllib
from xml.etree import ElementTree as etree

# Vortex imports
import com.dotmatics.vortex.util.Util as Util
import com.dotmatics.vortex.mol2img.jni.genImage as genImage
import com.dotmatics.vortex.mol2img.Mol2Img as mol2Img
import jarray
import binascii
import string
import os
import sys

input_label = swing.JLabel("Name column (for input)")
input_cb = workspace.getColumnComboBox()
panel = swing.JPanel()

layout.fill(panel, input_label, 0, 0)
layout.fill(panel, input_cb,    1, 0)

ret = vortex.showInDialog(panel, "Choose Drug Name Column column")

# You need to replace with your security token
token = 'Your Security Token'

if ret == vortex.OK:
    input_idx = input_cb.getSelectedIndex()

    if input_idx == 0:
        vortex.alert("you must choose a column")
    else:
        col = vtable.getColumn(input_idx - 1)

        colsmi = vtable.findColumnWithName('SMILES', 1)
        colcsid = vtable.findColumnWithName('CSID', 1)

        rows = vtable.getRealRowCount()
        fails = []
        for r in range(0, int(rows)):
            drugName = col.getValueAsString(r)



            encoded_name = urllib.quote(drugName)
            mystr = "http://www.chemspider.com/Search.asmx/SimpleSearch?query=%s&token=%s" % (encoded_name, token)
            try:
                myreturn = urllib2.urlopen(mystr).read()
                tree = etree.fromstring(myreturn)
                csid_el = tree.find('{http://www.chemspider.com/}int')
                if csid_el is None:
                    continue
                colcsid.setValueFromString(r, csid_el.text)
                info_url = "http://www.chemspider.com/Search.asmx/GetCompoundInfo?CSID=%s&token=%s" % (csid_el.text, token)
                info_response = urllib2.urlopen(info_url)
                tree = etree.parse(info_response)
                smiles = tree.getroot().find('{http://www.chemspider.com/}SMILES').text
                #colsmi = vtable.findColumnWithName('SMILES', 1)
                colsmi.setValueFromString(r, smiles)
            except urllib2.HTTPError:
                #fails.append(drugName)
                continue
            except urllib2.URLError:
            #fails.append(drugName)
                continue
            except httplib.HTTPException:
                #fails.append(drugName)
                continue

vtable.fireTableStructureChanged()

The Vortex script can be downloaded here DrugName2SMILESchemspider.vpy

Comparison of scripts

As you can see from the table below, OPSIN was by far the fastest. OPSIN ran through the workspace of nearly 14,000 structures in 5 seconds, however there were just over 1,500 names for which structures could not be assigned. ChemSpider identified the majority of structures from the names but was considerably slower. The chemical Identifier resolver left just under 1000 structures unresolved and was the slowest of the three methods. It should be noted however the performance of the web services will be dependent on network traffic and the load on the web server.

OPSIN
CIR
Chemspider
Time
5 sec
4h 30min
1h 50min
Unresolved
1571
939
161


I found that depending on the time of day sometimes the web servers stopped responding on long runs, if you are going to be looking up more than a few thousand names I'd recommend that you split it into chunks to avoid overloading the web server.

Change

for r in range(0, int(rows)):

to

for r in range(0, 4000):

then

for r in range(4000, 8000):

etc.


It is important to note that OPSIN converts the chemical name to a structure, whilst CIR and ChemSpider are lookup services. So whilst OPSIN will be able to convert the chemical names of any novel molecules to structures, the look up services will only be able to provide structures that exist in the database. However the databases will also contain synonyms, trivial names and trade names that could be used to identify a molecule, OPSIN was not designed to use these as input.

Page Updated 12 January 2016