Macs in Chemistry

Insanely great science


ChEMBL Models iPython Notebook

With the release of ChEMBL 21 has come a set of updated target predicted models.

The good news is that, besides the increase in terms of training data (compounds and targets), the new models were built using the latest stable versions of RDKit (2015.09.2) and scikit-learn (0.17). The latter was upgraded from the much older 0.14 version, which was causing incompatibility issues while trying to use the models.

They have also put together a quick Jupyter Notebook demo on how to get predictions from the models here The supplied notebook allows the user to submit a molecule and flag the targets for which the highest predicted activity resides.

I've been using the models and I thought I'd share an iPython Notebook I have created. This is based on the ChEMBL notebook with code tidbits taken from the absolutely invaluable Stack Overflow. I'm often in the situation where I actually want to know the predicted activity at specific targets, and specifically want to confirm lack of predicted activity at potential off-targets. I could have a notebook for each target but actually the speed of calculation means that I can calculate all the models and then just cherry pick those of interest.

The first part of the iPython notebook is very similar to the provided notebook, importing classes and running the predictive models.

Import classes

from rdkit.Chem import AllChem as Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import PandasTools
from rdkit.Chem import Draw
from rdkit import rdBase
from rdkit import DataStructs

import pandas as pd
from pandas import concat

from collections import OrderedDict
import requests
import numpy as np
import matplotlib.pyplot as plt
from sklearn.externals import joblib

%matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = 10,10

morgan_nb = joblib.load('./models_21/10uM/mNB_10uM_all.pkl')
classes = list(morgan_nb.classes_)


smiles = 'CN(C)CCC1COC2=C(O1)C=CC=C2' #Test Compound

mol = Chem.MolFromSmiles(smiles)


Run predictive models

fp = Chem.GetMorganFingerprintAsBitVect(mol,2,nBits=2048, bitInfo=info)
res = np.zeros(len(fp),np.int32)

probas = list(morgan_nb.predict_proba(res.reshape(1,-1))[0])
classes = list(morgan_nb.targets)
predictions = pd.DataFrame(zip(classes, probas), columns=['id','proba'])


id proba


CHEMBL1075051 2.106367e-16
1 CHEMBL1075104 1.258508e-09
2 CHEMBL1075108 4.318425e-11
3 CHEMBL1075111 3.499282e-11
4 CHEMBL1075115 2.756820e-11
5 CHEMBL1075138 1.923020e-03
6 CHEMBL1075140 2.121424e-12
7 CHEMBL1075145 5.730461e-14
8 CHEMBL1075152 3.569617e-12

As the plot below shows whilst most targets are clear of activity there are a few with a high probability of activity.


Get target information

Getting the target information can take a little while for all the models so the first time we need to download the information and then save as pickle, on subsequent runs we can simply read the local file and save time.

def fetch_WS(trgt):
    re = requests.get('{0}.json'.format(trgt))
    # return (trgt, re.json()['pref_name'], re.json()['organism'])
    return (trgt, re.json()['pref_name'], re.json()['organism'], [c['accession'] for c in re.json()['target_components']])

# If pickle doesn't exist, download target data from web service and save as pickle
if not os.path.isfile('./models_21/10uM/alltargetinfo.pkl'):
    plist = []
    for e in top_preds['id'][:10]:
    target_info = pd.DataFrame(plist, columns =['id','name', 'organism', 'UNIPROT'])

# Load target data from pickle
target_info = pd.read_pickle('./models_21/10uM/alltargetinfo.pkl')


id name organism UNIPROT


CHEMBL315 Alpha-1b adrenergic receptor Rattus norvegicus [P15823]
1 CHEMBL1942 Alpha-2b adrenergic receptor Homo sapiens [P18089]
2 CHEMBL3459 Serotonin 1b (5-HT1b) receptor Rattus norvegicus [P28564]
3 CHEMBL1916 Alpha-2c adrenergic receptor Homo sapiens [P18825]
4 CHEMBL1867 Alpha-2a adrenergic receptor Homo sapiens [P08913]
5 CHEMBL273 Serotonin 1a (5-HT1a) receptor Rattus norvegicus [P19327]
6 CHEMBL3933 Prostanoid DP receptor Mus musculus [P70263]
7 CHEMBL214 Serotonin 1a (5-HT1a) receptor Homo sapiens [P08908]

We can then merge the activity data with the target information.

result = pd.merge(top_preds, target_info)

id proba name organism UNIPROT


CHEMBL315 9.999994e-01 Alpha-1b adrenergic receptor Rattus norvegicus [P15823]
1 CHEMBL1942 9.999844e-01 Alpha-2b adrenergic receptor Homo sapiens [P18089]
2 CHEMBL3459 9.999494e-01 Serotonin 1b (5-HT1b) receptor Rattus norvegicus [P28564]
3 CHEMBL1916 9.999384e-01 Alpha-2c adrenergic receptor Homo sapiens [P18825]
4 CHEMBL1867 9.998181e-01 Alpha-2a adrenergic receptor Homo sapiens [P08913]
5 CHEMBL273 9.993896e-01 Serotonin 1a (5-HT1a) receptor Rattus norvegicus [P19327]

Select required predictive models

We then select the data for the predicted models that we want to display. You can obviously edit this to display data from those models that are of most interest to you.

p_5HT2A = result.loc[result['id']=='CHEMBL322', 'proba'].values[0] #Serotonin 2a (5-HT2a) receptor
p_alpha1b = result.loc[result['id']=='CHEMBL315', 'proba'].values[0] #Alpha-1b adrenergic receptor
p_DopaD2 = result.loc[result['id']=='CHEMBL339', 'proba'].values[0] #Dopamine D2 receptor
p_alpha1d = result.loc[result['id']=='CHEMBL223', 'proba'].values[0] #Alpha-1d adrenergic receptor
p_alpha2a = result.loc[result['id']=='CHEMBL327', 'proba'].values[0] #Alpha-2a adrenergic receptor
p_SeroTran = result.loc[result['id']=='CHEMBL313', 'proba'].values[0] # Serotonin transporter

Create the radar plot

A radar chart is a graphical method of displaying multivariate data in the form of a two-dimensional chart of multiple quantitative variables each represented on axes starting from the same point.

    def _scale_data(data, ranges):
    """scales data[1:] to ranges[0]"""
    x1, x2 = ranges[0]
    d = data[0]
    sdata = [d]
    for d, (y1, y2) in zip(data[1:], ranges[1:]):
        sdata.append((d-y1) / (y2-y1) * (x2 - x1) + x1)
    return sdata

class ComplexRadar():

    def __init__(self, fig, variables, ranges,
        angles = np.arange(0, 360, 360./len(variables))
        axes = [fig.add_axes([0.1,0.1,0.9,0.9],polar=True,
                label = "axes{}".format(i)) 
                for i in range(len(variables))]
        l, text = axes[0].set_thetagrids(angles, labels=variables)
        [txt.set_rotation(angle-90) for txt, angle 
             in zip(text, angles)]
        for ax in axes[1:]:
        for i, ax in enumerate(axes):
            grid = np.linspace(*ranges[i], num=n_ordinate_levels)
            gridlabel = ["{}".format(round(x,1))  for x in grid]
            gridlabel[0] = "" # clean up origin
            ax.set_rgrids(grid, labels=gridlabel, angle=angles[i])
        # variables for plotting
        self.angle = np.deg2rad(np.r_[angles, angles[0]])
        self.ranges = ranges = axes[0]

    def plot(self, data, *args, **kw):
        sdata = _scale_data(data, self.ranges), np.r_[sdata, sdata[0]], *args, **kw)

    def fill(self, data, *args, **kw):
        sdata = _scale_data(data, self.ranges), np.r_[sdata, sdata[0]], *args, **kw)

variables = ("5HT2A", "alpha1b", "DopaD2", "alpha1d", "alpha2a", "SeroTrans")
data = (p_5HT2A, p_alpha1b, p_DopaD2, p_alpha1d, p_alpha2a, p_SeroTran)

ranges = [(0.01, 1.0), (0.01, 1.0), (0.01, 1.0), (0.01, 1.0), (0.01, 1.0), (0.01, 1.0)]   

# plotting
fig1 = plt.figure(figsize=(6, 6))
radar = ComplexRadar(fig1, variables, ranges)
radar.fill(data, alpha=0.2)


The python notebook can be downloaded from here

Last updated 6 April 2016.