Getting UniChem data from ChEMBL
UniChem is a web resource provided by the EBI, it is a 'Unified Chemical Identifier' system, designed to assist in the rapid cross-referencing of chemical structures, and their identifiers, between multiple databases. Currently the UniChem contains data from 27 different data sources. Currently UniChem provides links to 108,941,995 structures.
Chambers, J., Davies, M., Gaulton, A., Hersey, A., Velankar, S., Petryszak, R., Hastings, J., Bellis, L., McGlinchey, S. and Overington, J.P. UniChem: A Unified Chemical Structure Cross-Referencing and Identifier Tracking System. Journal of Cheminformatics 2013, 5:3 (January 2013). DOI: http://dx.doi.org/10.1186/1758-2946-5-3
Whilst I've written about a script to search using InChi keys it is also possible to search using compound identifiers.
ChEMBL also provide a RESTful Web service that users can use to retrieve data from the UniChem database in a programmatic fashion.
All RESTful queries are constructed using the following base url
https://www.ebi.ac.uk/unichem/rest/
Specific query urls are then constructed by adding a method name to this base url, followed by input data.
Input data may consist of three types
src_compound_id (the molecule identifier)
src_id (the number for the datasource, ChEMBL is 1)
InChIKey
Since the different datasources will have different molecule identifiers for the same molecule it is important to have both the ID and the corresponding datasource.
Since we have the ChEMBLID our URL will have the form
https://www.ebi.ac.uk/unichem/rest/src_compound_id/CHEMBL1089/1
By default the data is returned in JSON format, with the key-value pairs being, datasource and the compound ID.
#Data format [{"src_id":"1","src_compound_id":"CHEMBL1089"},{"src_id":"2","src_compound_id":"DB00780"},{"src_id":"4","src_compound_id":"7266"},{"src_id":"6","src_compound_id":"C07430"},{"src_id":"7","src_compound_id":"8060"},{"src_id":"8","src_compound_id":"SAM002589985"},{"src_id":"10","src_compound_id":"1987170"},{"src_id":"11","src_compound_id":"F484C6DCFFC08118224D7D07C06DD841"},{"src_id":"14","src_compound_id":"O408N561GF"},{"src_id":"15","src_compound_id":"SCHEMBL34335"},{"src_id":"17","src_compound_id":"PA450903"},{"src_id":"18","src_compound_id":"HMDB14918"},{"src_id":"21","src_compound_id":"15297289"},{"src_id":"22","src_compound_id":"3675"},{"src_id":"23","src_compound_id":"MCULE-2911295500"},{"src_id":"25","src_compound_id":"LSM-5928"},{"src_id":"26","src_compound_id":"51-71-8"},{"src_id":"29","src_compound_id":"J4.125D"},{"src_id":"31","src_compound_id":"50105417"}]
The first part of the script asks the user to select the column contains the ChEMBLID, then we create the columns. Then we loop through the workspace calling the web service for each ID, parse the returned JSON and populate the workspace as shown below. (Click on the image to see a larger view).
It should be straightforward to modify the script to search any of the datasources with the appropriate list of molecule identifiers.
The Vortex Script
#Use ChEMBLid to search using Unichem to get all data
#http://www.macinchem.org
#All rights reserved.
# Python imports
import urllib2
import urllib
from com.xhaus.jyson import JysonCodec as json
# Vortex imports
import com.dotmatics.vortex.util.Util as Util
import com.dotmatics.vortex.mol2img.jni.genImage as genImage
import com.dotmatics.vortex.mol2img.Mol2Img as mol2Img
import jarray
import binascii
import string
import os
input_label = swing.JLabel("ChEMBLid column (for input)")
input_cb = workspace.getColumnComboBox()
panel = swing.JPanel()
layout.fill(panel, input_label, 0, 0)
layout.fill(panel, input_cb, 1, 0)
ret = vortex.showInDialog(panel, "Choose ChEMBLid column")
if ret == vortex.OK:
input_idx = input_cb.getSelectedIndex()
if input_idx == 0:
vortex.alert("you must choose a column")
else:
chosen_col = vtable.getColumn(input_idx - 1)
#col names from here https://www.ebi.ac.uk/unichem/ucquery/listSources
cols = {
'2': vtable.findColumnWithName('Drugbank', 1), #2
'3': vtable.findColumnWithName('PBD', 1), #3
'4': vtable.findColumnWithName('Guide to Pharm', 1), #4
'5': vtable.findColumnWithName('Drugs of the Future', 1), #5
'6': vtable.findColumnWithName('Kegg Ligand', 1), #6
'7': vtable.findColumnWithName('ChEBI', 1), #7
'8': vtable.findColumnWithName('NIH Clinical', 1), #8
'9': vtable.findColumnWithName('ZINC', 1), #9
'10': vtable.findColumnWithName('eMolecules', 1), #10
'11': vtable.findColumnWithName('IBM IP', 1), #11
'12': vtable.findColumnWithName('Gene Expression', 1), #12
'14': vtable.findColumnWithName('NFDA Substance', 1), #14
'15': vtable.findColumnWithName('SureChEMBL Patents', 1), #15
'17': vtable.findColumnWithName('PharmGKB', 1), #17
'18': vtable.findColumnWithName('Human Metab', 1), #18
'20': vtable.findColumnWithName('Selleck', 1), #20
'21': vtable.findColumnWithName('Thomson Pharma', 1), #21
'22': vtable.findColumnWithName('Pubchem', 1), #22
'23': vtable.findColumnWithName('Mcule', 1), #23
'24': vtable.findColumnWithName('NMR shift DB', 1), #24
'25': vtable.findColumnWithName('Networks', 1), #25
'26': vtable.findColumnWithName('Toxicology Resource', 1), #26
'27': vtable.findColumnWithName('Human Metab', 1), #27
'28': vtable.findColumnWithName('MolPort', 1), #28
'29': vtable.findColumnWithName('Japanese Chemicals', 1), #29
'31': vtable.findColumnWithName('BindingDB', 1), #31
}
rows = vtable.getRealRowCount()
for r in range(0, int(rows)):
chembl_id = chosen_col.getValueAsString(r)
# "https://www.ebi.ac.uk/unichem/rest/src_compound_id/CHEMBL1089/1"
api_url = 'https://www.ebi.ac.uk/unichem/rest/src_compound_id/%s/1' % chembl_id
try:
molecule_record = urllib2.urlopen(api_url).read()
except urllib2.HTTPError:
continue
j = json.loads(molecule_record)
for entry in j:
src_id = entry['src_id']
if src_id in cols:
cols[src_id].setValueFromString(r, entry['src_compound_id'])
vtable.fireTableStructureChanged()
The script can be downloaded from here https://macinchem.org/reviews/vortexscripts/ChEMBLid2allID.vpy.zip
Page Updated 15 February 2016