Macs in Chemistry

Insanely great science

 

Getting UniChem data from ChEMBL

UniChem is a web resource provided by the EBI, it is a 'Unified Chemical Identifier' system, designed to assist in the rapid cross-referencing of chemical structures, and their identifiers, between multiple databases. Currently the UniChem contains data from 27 different data sources. Currently UniChem provides links to 108,941,995 structures.

Chambers, J., Davies, M., Gaulton, A., Hersey, A., Velankar, S., Petryszak, R., Hastings, J., Bellis, L., McGlinchey, S. and Overington, J.P. UniChem: A Unified Chemical Structure Cross-Referencing and Identifier Tracking System. Journal of Cheminformatics 2013, 5:3 (January 2013). DOI: http://dx.doi.org/10.1186/1758-2946-5-3

Whilst I've written about a script to search using InChi keys it is also possible to search using compound identifiers.

ChEMBL also provide a RESTful Web service that users can use to retrieve data from the UniChem database in a programmatic fashion.

All RESTful queries are constructed using the following base url

https://www.ebi.ac.uk/unichem/rest/

Specific query urls are then constructed by adding a method name to this base url, followed by input data.

Input data may consist of three types

src_compound_id (the molecule identifier)
src_id (the number for the datasource, ChEMBL is 1)
InChIKey

Since the different datasources will have different molecule identifiers for the same molecule it is important to have both the ID and the corresponding datasource.

Since we have the ChEMBLID our URL will have the form

https://www.ebi.ac.uk/unichem/rest/src_compound_id/CHEMBL1089/1

By default the data is returned in JSON format, with the key-value pairs being, datasource and the compound ID.

#Data format [{"src_id":"1","src_compound_id":"CHEMBL1089"},{"src_id":"2","src_compound_id":"DB00780"},{"src_id":"4","src_compound_id":"7266"},{"src_id":"6","src_compound_id":"C07430"},{"src_id":"7","src_compound_id":"8060"},{"src_id":"8","src_compound_id":"SAM002589985"},{"src_id":"10","src_compound_id":"1987170"},{"src_id":"11","src_compound_id":"F484C6DCFFC08118224D7D07C06DD841"},{"src_id":"14","src_compound_id":"O408N561GF"},{"src_id":"15","src_compound_id":"SCHEMBL34335"},{"src_id":"17","src_compound_id":"PA450903"},{"src_id":"18","src_compound_id":"HMDB14918"},{"src_id":"21","src_compound_id":"15297289"},{"src_id":"22","src_compound_id":"3675"},{"src_id":"23","src_compound_id":"MCULE-2911295500"},{"src_id":"25","src_compound_id":"LSM-5928"},{"src_id":"26","src_compound_id":"51-71-8"},{"src_id":"29","src_compound_id":"J4.125D"},{"src_id":"31","src_compound_id":"50105417"}]

The first part of the script asks the user to select the column contains the ChEMBLID, then we create the columns. Then we loop through the workspace calling the web service for each ID, parse the returned JSON and populate the workspace as shown below. (Click on the image to see a larger view).

unichemworkspace

It should be straightforward to modify the script to search any of the datasources with the appropriate list of molecule identifiers.

The Vortex Script

#Use ChEMBLid to search using Unichem to get all data
#http://www.macinchem.org
#All rights reserved.

# Python imports
import urllib2
import urllib
from com.xhaus.jyson import JysonCodec as json

# Vortex imports
import com.dotmatics.vortex.util.Util as Util
import com.dotmatics.vortex.mol2img.jni.genImage as genImage
import com.dotmatics.vortex.mol2img.Mol2Img as mol2Img
import jarray
import binascii
import string
import os

input_label = swing.JLabel("ChEMBLid column (for input)")
input_cb = workspace.getColumnComboBox()
panel = swing.JPanel()

layout.fill(panel, input_label, 0, 0)
layout.fill(panel, input_cb,    1, 0)

ret = vortex.showInDialog(panel, "Choose ChEMBLid column")

if ret == vortex.OK:
    input_idx = input_cb.getSelectedIndex()

    if input_idx == 0:
        vortex.alert("you must choose a column")
    else:
        chosen_col = vtable.getColumn(input_idx - 1)
        #col names from here https://www.ebi.ac.uk/unichem/ucquery/listSources

        cols = {
            '2': vtable.findColumnWithName('Drugbank', 1), #2
            '3': vtable.findColumnWithName('PBD', 1), #3
            '4': vtable.findColumnWithName('Guide to Pharm', 1), #4
            '5': vtable.findColumnWithName('Drugs of the Future', 1), #5
            '6': vtable.findColumnWithName('Kegg Ligand', 1), #6
            '7': vtable.findColumnWithName('ChEBI', 1), #7
            '8': vtable.findColumnWithName('NIH Clinical', 1), #8
            '9': vtable.findColumnWithName('ZINC', 1), #9
            '10': vtable.findColumnWithName('eMolecules', 1), #10
            '11': vtable.findColumnWithName('IBM IP', 1), #11
            '12': vtable.findColumnWithName('Gene Expression', 1), #12
            '14': vtable.findColumnWithName('NFDA Substance', 1), #14
            '15': vtable.findColumnWithName('SureChEMBL Patents', 1), #15
            '17': vtable.findColumnWithName('PharmGKB', 1), #17
            '18': vtable.findColumnWithName('Human Metab', 1), #18
            '20': vtable.findColumnWithName('Selleck', 1), #20
            '21': vtable.findColumnWithName('Thomson Pharma', 1), #21
            '22': vtable.findColumnWithName('Pubchem', 1), #22
            '23': vtable.findColumnWithName('Mcule', 1), #23
            '24': vtable.findColumnWithName('NMR shift DB', 1), #24
            '25': vtable.findColumnWithName('Networks', 1), #25
            '26': vtable.findColumnWithName('Toxicology Resource', 1), #26
            '27': vtable.findColumnWithName('Human Metab', 1), #27
            '28': vtable.findColumnWithName('MolPort', 1), #28
            '29': vtable.findColumnWithName('Japanese Chemicals', 1), #29
            '31': vtable.findColumnWithName('BindingDB', 1), #31
        }

        rows = vtable.getRealRowCount()
        for r in range(0, int(rows)):
            chembl_id = chosen_col.getValueAsString(r)
            # "https://www.ebi.ac.uk/unichem/rest/src_compound_id/CHEMBL1089/1"
            api_url = 'https://www.ebi.ac.uk/unichem/rest/src_compound_id/%s/1' % chembl_id
            try:
                molecule_record = urllib2.urlopen(api_url).read()
            except urllib2.HTTPError:
                continue
            j = json.loads(molecule_record)
            for entry in j:
                src_id = entry['src_id']
                if src_id in cols:
                    cols[src_id].setValueFromString(r, entry['src_compound_id'])




vtable.fireTableStructureChanged()

The script can be downloaded from here https://macinchem.org/reviews/vortexscripts/ChEMBLid2allID.vpy.zip

Page Updated 15 February 2016