Macs in Chemistry

Insanely great science

 

Predicting AMES activity Jupyter Notebook

I've been experimenting with the use of Jupyter Notebooks (aka iPython Notebooks) as an electronic lab notebook but also a means to share computational models. The aim would be to see how easy it would be to provide a model together with the associated training data together with an explanation of how the model was built and how it can be used for novel molecules.

In this first notebook a random forest model to predict AMES activity is described. For the training data I used a published set, Benchmark Data Set for in Silico Prediction of Ames Mutagenicity, Katja Hansen†, Sebastian Mika‡, Timon Schroeter†, Andreas Sutter§, Antonius ter Laak§, Thomas Steger-Hartmann§, Nikolaus Heinrich§ and Klaus-Robert Müller*†, University of Technology, Berlin, Germany, idalab GmbH, Berlin, Germany, and Bayer Schering Pharma AG, Berlin, Germany. J. Chem. Inf. Model., 2009, 49 (9), pp 2077–2081 DOI. This data set describes the structures and experimentally determined AMES activity for over 6500 molecules.

The Ames test is a widely employed method that uses bacteria to test whether a given chemical can cause mutations in the DNA of the test organism. More formally, it is a biological assay to assess the mutagenic potential of chemical compounds. PNAS. 70 (8): 2281–5. doi

The first step was to check the data set, I used MOE to create a molecular database and imported all structures to check they were valid, I then used the wash command to remove, counter ions (salts), rebalance protonations states, add explicit hydrogens and check tautomers. Since I want to use the model to evaluate potential organic drug-like molecules I also removed things like inorganic chromium salts. I also renamed the column containing the record ID to "IDNUMBER". This left 6011 molecules, which were then exported to an sdf file "AMESdata.sdf".

amesdata

Building the model

With the tided up dataset available we can now write the notebook, the first step is to import the required Python classes. I've tagged them with the versions used. I actually think this is a very useful policy, I've found several instances where the results vary depending on the versions used.

from rdkit import Chem # rdkit 2016.03.5
from rdkit.Chem import PandasTools
import pandas as pd # pandas==0.17.1
import pandas_ml as pdml # pandas-ml==0.4.0
from rdkit.Chem import AllChem, DataStructs
import numpy #numpy==1.12.0
from sklearn.model_selection import train_test_split # scikit-learn==0.18.1
import subprocess
from StringIO import StringIO
import pickle
import os
%matplotlib inline

We next import the dataset into a Pandas data frame using the rdkit PandasTools

sdfFilePath = 'AMESdata.sdf'
AMESmolsdf = PandasTools.LoadSDF(sdfFilePath)
AMESmolsdf.head(n=2)

ameshead

Used a Chemaxon tool evaluate to calculate a variety of physiochemical properties, this is accessible from the command line. The output requires a little manipulation, depending on the format of the IDNUMBER the process can put in thousands separators so we need to remove them. The acid and base pKa calculations will of course leave blank fields for neutral molecules so we change them to a simple boolean as to whether they will be positively of negatively ionised at physiological pKa.

p = subprocess.Popen(['/Applications/ChemAxon/MarvinBeans/bin/evaluate', sdfFilePath, '-g', '-e', "field('IDNUMBER');molString('smiles');logp(); logd('7.4'); apka('1'); bpka('1'); atomCount(); mass(); acceptorcount(); donorcount(); topologicalPolarSurfaceArea(); rotatablebondcount(); atomCount()-atomCount('1');aromaticAtomCount()/(atomCount()-atomCount('1'))"], stdout=subprocess.PIPE)

output = p.communicate()[0]
#remove commas evaluate puts into IDNUMBERs
output=output.replace(',','')
df = pd.read_csv(StringIO(output), sep=';', names= ['IDNUMBER', 'SMILES', 'logP', 'logD', 'apka', 'bpka', 'atomCount', 'mass', 'HBA', 'HBD', 'PSA', 'RBC', 'HAC', 'FractionAromatic'])
df['Base'] = df['bpka'] > 7.5 #true or false
df['Acid'] = df['apka'] < 7.5
df.Base = df.Base.astype(int) #convert to boolean int
df.Acid = df.Acid.astype(int)
df.head()

head1

We can now merge the two data frames using the IDNUMBER

AMESmolsdf = pd.merge(df, AMESmolsdf, left_on='IDNUMBER', right_on='IDNUMBER')

Now we calculate the Morgan fingerprints with radius 2, as molecular descriptors, J. Chem. Inf. Model., 2010, 50 (5), pp 742–754 DOI.

def getFparr( mol ):
    fp = AllChem.GetMorganFingerprintAsBitVect( mol,2 )
    arr = numpy.zeros((1,))
    DataStructs.ConvertToNumpyArray( fp , arr )
    return arr
AMESmolsdf['mfp2']  = AMESmolsdf.ROMol.apply(getFparr)
AMESmolsdf.to_csv("Dataformodel.csv")#store data and calculated properties

Then we convert all the molecular descriptors to a list and create a data frame for the machine learning. We could loop over the attributes using res[x] = getattr(AMESmolsdf, x), however I've used this rather lengthly version so that it is easy to comment out any attributes you want to exclude from the model. If you do this you will need to make the same change in the section of the script dealing with novel query molecules. It is also useful for trouble shooting.

res = pd.DataFrame(AMESmolsdf.mfp2.tolist())
res["LogP"] = AMESmolsdf.logP
res["LogD"] = AMESmolsdf.logD
res["Acid"] = AMESmolsdf.Acid
res["Base"] = AMESmolsdf.Base
res["Mass"] = AMESmolsdf.mass
res["HBA"] = AMESmolsdf.HBA
res["HBD"] = AMESmolsdf.HBD
res["PSA"] = AMESmolsdf.PSA
res["RBC"] = AMESmolsdf.RBC
res["HAC"] = AMESmolsdf.HAC
res["FractionAromatic"] = AMESmolsdf.FractionAromatic
res["AMES_Activity"] = AMESmolsdf.AMES_Activity
res.head(n=2)

res

df4ml = pdml.ModelFrame( res, target='AMES_Activity' )

We then split the dataset into a training and a test set prior to generating the random forest model.

train_df, test_df = df4ml.model_selection.train_test_split(test_size=0.1, random_state=0)

get a random forest classifier with 100 trees,

rfdf = df4ml.ensemble.RandomForestClassifier(n_estimators=100, random_state=0)
train_df.fit(rfdf)

testhead

However once the model has been generated it would be nice to be able to save it so we don't have to regenerate it each time we use it, and also share it, so we wrap this using pickle

if os.path.exists('rfdf.pickle'):
    print('Loading from pickle')
    # pickle file exists, just load it
    with open('rfdf.pickle', 'rb') as f:
        rfdf = pickle.load(f)   
else:
    print('Generating model')
    # No pickle file, generate model
    rfdf = df4ml.ensemble.RandomForestClassifier(n_estimators=100, random_state=0)
    train_df.fit(rfdf)
    with open('rfdf.pickle', 'wb') as f:
        pickle.dump(rfdf, f)

So the first time the model will be generated, subsequently rfdf.pickle will be loaded which is obviously much more efficient.

Testing the model

We can then use the test_df to evaluate the model and display the results using a confusion matrix. The first time through you can edit the descriptors or model parameters to try and improve the model. But remember you will need to edit the relevant components when using the model to predict novel molecules.

predict_rf=test_df.predict(rfdf)
test_df.metrics.confusion_matrix()

confusionmatrix

If you want to do a more detailed analysis in an external program we can export the results. First we rename the column containing the predicted results.

predict_df=pd.DataFrame(predict_rf)
predict_df.rename(columns = {0:'Pred'}, inplace = True)

Then we combine with experimental data and descriptors (test_df), and export the results (testResults.csv) to a file.

Compare_df= pd.concat([predict_df, test_df], axis = 1)
Compare_df.to_csv("testResults.csv")
Compare_df.head(n=3)

test_pred

The testResults.csv file can be opened in other application (I used Vortex) for further investigation.

histograms

Evaluating Novel Compounds

The model can now be used to evaluate novel compounds, in the first instance a single molecule imported as a SMILES string.This time we generate the Morgan fingerprints first. Note we also create a column called ID, we will use this column to merge with the physicochemical property calculation later.

#Testing a novel SMILES string, 
mySMILES ='N1C=CC2=C(C=CC=C12)C1=C2C=C3C=CC=CC3=CC2=CC=C1'
from rdkit import RDConfig
mySMILESdf1 = pd.DataFrame(columns=['Smiles', 'ID']) #need ID to match dataframes later
mySMILESdf1 = mySMILESdf1.append({'Smiles':mySMILES, 'ID':123}, ignore_index=True)   
PandasTools.AddMoleculeColumnToFrame(mySMILESdf1,'Smiles','Molecule')
mySMILESdf1['mfp2']  = mySMILESdf1.Molecule.apply(getFparr)
mySMILESdf1

smilesstring

Now we use "evaluate" to generate the physicochemical property data and merge with molecular descriptor data.

p = subprocess.Popen(['/Applications/ChemAxon/MarvinBeans/bin/evaluate', mySMILES , '-g', '-e', "molString('smiles');logp(); logd('7.4'); apka('1'); bpka('1'); atomCount(); mass(); acceptorcount(); donorcount(); topologicalPolarSurfaceArea(); rotatablebondcount(); atomCount()-atomCount('1');aromaticAtomCount()/(atomCount()-atomCount('1'))"], stdout=subprocess.PIPE)

output = p.communicate()[0]

#remove commas evaluate puts into IDNUMBERs
output=output.replace(',','')
mySMILESdf = pd.read_csv(StringIO(output), sep=';', names= ['SMILES', 'logP', 'logD', 'apka', 'bpka', 'atomCount', 'mass', 'HBA', 'HBD', 'PSA', 'RBC', 'HAC', 'FractionAromatic'])
mySMILESdf['Base'] = mySMILESdf['bpka'] > 7.5 #true or false
mySMILESdf['Acid'] = mySMILESdf['apka'] < 7.5
mySMILESdf.Base = mySMILESdf.Base.astype(int) #convert to boolean int
mySMILESdf.Acid = mySMILESdf.Acid.astype(int)
mySMILESdf['ID']= 123

mySMILESdf = pd.merge(mySMILESdf, mySMILESdf1, left_on='ID', right_on='ID')

As before we create the query data frame from the list of molecular descriptors and the calculated properties (NB if you have changed/deleted any of the descriptors used to generate the model you will also need to change them here).

resQuery = pd.DataFrame(mySMILESdf.mfp2.tolist())
resQuery["LogP"] = mySMILESdf.logP
resQuery["LogD"] = mySMILESdf.logD
resQuery["Acid"] = mySMILESdf.Acid
resQuery["Base"] = mySMILESdfBase
resQuery["Mass"] = mySMILESdf.mass
resQuery["HBA"] = mySMILESdf.HBA
resQuery["HBD"] = mySMILESdf.HBD
resQuery["PSA"] = mySMILESdf.PSA
resQuery["RBC"] = mySMILESdf.RBC
resQuery["HAC"] = mySMILESdf.HAC
resQuery["FractionAromatic"] = mySMILESdf.FractionAromatic
resQuery

resquery

We can use the model to predict activity

rint rfdf.predict(resQuery)
print rfdf.predict_proba(resQuery)
['1']
[[ 0.23  0.77]]

We use a similar process to predict the activities for a file containing structures. In this case I'm using a file containing 4337 structures taken from Derivation and Validation of Toxicophores for Mutagenicity Prediction, J. Med. Chem., 2005, 48 (1), pp 312–320. DOI. The files were first cleaned up using MOE as described above.

#Predicting a file
sdfnewFilePath = 'cas_4337.sdf'
AMESnewsdf = PandasTools.LoadSDF(sdfnewFilePath)
p = subprocess.Popen(['/Applications/ChemAxon/MarvinBeans/bin/evaluate', sdfnewFilePath, '-g', '-e', "field('IDNUMBER');molString('smiles');logp(); logd('7.4'); apka('1'); bpka('1'); atomCount(); mass(); acceptorcount(); donorcount(); topologicalPolarSurfaceArea(); rotatablebondcount(); atomCount()-atomCount('1');aromaticAtomCount()/(atomCount()-atomCount('1'))"], stdout=subprocess.PIPE)

output = p.communicate()[0]
#remove commas evaluate puts into IDNUMBERs
output=output.replace(',','')
newdf = pd.read_csv(StringIO(output), sep=';', names= ['IDNUMBER', 'SMILES', 'logP', 'logD', 'apka', 'bpka', 'atomCount', 'mass', 'HBA', 'HBD', 'PSA', 'RBC', 'HAC', 'FractionAromatic'])

newdf['Base'] = newdf['bpka'] > 7.5 #true or false
newdf['Acid'] = newdf['apka'] < 7.5
newdf.Base = newdf.Base.astype(int) #convert to boolean int
newdf.Acid = newdf.Acid.astype(int)


AMESnewsdf = pd.merge(newdf, AMESnewsdf, left_on='IDNUMBER', right_on='IDNUMBER')

AMESnewsdf.head(n=2)

def getFparr( mol ):
    fp = AllChem.GetMorganFingerprintAsBitVect( mol,2 )
    arr = numpy.zeros((1,))
    DataStructs.ConvertToNumpyArray( fp , arr )
    return arr
AMESnewsdf['mfp2']  = AMESnewsdf.ROMol.apply(getFparr)

resnew = pd.DataFrame(AMESnewsdf.mfp2.tolist())
resnew["LogP"] = AMESnewsdf.logP
resnew["LogD"] = AMESnewsdf.logD
resnew["Acid"] = AMESnewsdf.Acid
resnew["Base"] = AMESnewsdf.Base
resnew["Mass"] = AMESnewsdf.mass
resnew["HBA"] = AMESnewsdf.HBA
resnew["HBD"] = AMESnewsdf.HBD
resnew["PSA"] = AMESnewsdf.PSA
resnew["RBC"] = AMESnewsdf.RBC
resnew["HAC"] = AMESnewsdf.HAC
resnew["FractionAromatic"] = AMESnewsdf.FractionAromatic

newdf4ml = pdml.ModelFrame(resnew)
Newpredict_rf=newdf4ml.predict(rfdf)

We merge the results with the input data, we could export this data but first we can tidy things up a little, rename the column containing the predicted result to "Predicted" and removing the columns containing the image and fingerprints to reduce the file size.

Newpredict_df=pd.DataFrame(Newpredict_rf)
Results_df= pd.concat([Newpredict_df, AMESnewsdf], axis = 1)
Results_df= Results_df.drop('mfp2', axis=1)
Results_df= Results_df.drop('ROMol', axis=1)#remove image column
Results_df= Results_df.rename(columns={0: "Pred"}) #rename column containing predicted result
Results_df.head(n=5)

predicted

Results_df.to_csv("QueryResults.csv")

We can import the results into other applications for further analysis.

Simply Using a Precalculated model

Once the model is built we don't need to rebuild it everytime and so we can eliminate the time-consuming loading to the training set and calculation of properties

from rdkit import Chem # rdkit 2016.03.5
from rdkit.Chem import PandasTools
import pandas as pd # pandas==0.17.1
import pandas_ml as pdml # pandas-ml==0.4.0
from rdkit.Chem import AllChem, DataStructs
import numpy #numpy==1.12.0
from sklearn.model_selection import train_test_split # scikit-learn==0.18.1
import subprocess
from StringIO import StringIO
import pickle
import os
%matplotlib inline

First import the python classes and then load rfdf.pickle

if os.path.exists('rfdf.pickle'):
    print('Loading from pickle')
    # pickle file exists, just load it
    with open('rfdf.pickle', 'rb') as f:
        rfdf = pickle.load(f)

Then we can submit queries as before, for a single compound

#Testing a novel SMILES string, 
mySMILES ='N1C=CC2=C(C=CC=C12)C1=C2C=C3C=CC=CC3=CC2=CC=C1'
from rdkit import RDConfig
mySMILESdf1 = pd.DataFrame(columns=['Smiles', 'ID']) #need ID to match dataframes later
mySMILESdf1 = mySMILESdf1.append({'Smiles':mySMILES, 'ID':123}, ignore_index=True)   
PandasTools.AddMoleculeColumnToFrame(mySMILESdf1,'Smiles','Molecule')
def getFparr( mol ):
    fp = AllChem.GetMorganFingerprintAsBitVect( mol,2 )
    arr = numpy.zeros((1,))
    DataStructs.ConvertToNumpyArray( fp , arr )
    return arr
mySMILESdf1['mfp2']  = mySMILESdf1.Molecule.apply(getFparr)

p = subprocess.Popen(['/Applications/ChemAxon/MarvinBeans/bin/evaluate', mySMILES , '-g', '-e', "molString('smiles');logp(); logd('7.4'); apka('1'); bpka('1'); atomCount(); mass(); acceptorcount(); donorcount(); topologicalPolarSurfaceArea(); rotatablebondcount(); atomCount()-atomCount('1');aromaticAtomCount()/(atomCount()-atomCount('1'))"], stdout=subprocess.PIPE)

output = p.communicate()[0]

#remove commas evaluate puts into IDNUMBERs
output=output.replace(',','')
mySMILESdf = pd.read_csv(StringIO(output), sep=';', names= ['SMILES', 'logP', 'logD', 'apka', 'bpka', 'atomCount', 'mass', 'HBA', 'HBD', 'PSA', 'RBC', 'HAC', 'FractionAromatic'])
mySMILESdf['Base'] = mySMILESdf['bpka'] > 7.5 #true or false
mySMILESdf['Acid'] = mySMILESdf['apka'] < 7.5
mySMILESdf.Base = mySMILESdf.Base.astype(int) #convert to boolean int
mySMILESdf.Acid = mySMILESdf.Acid.astype(int)
mySMILESdf['ID']= 123
mySMILESdf

smilesstring

mySMILESdf = pd.merge(mySMILESdf, mySMILESdf1, left_on='ID', right_on='ID')
resQuery = pd.DataFrame(mySMILESdf.mfp2.tolist())
resQuery["LogP"] = mySMILESdf.logP
resQuery["LogD"] = mySMILESdf.logD
resQuery["Acid"] = mySMILESdf.Acid
resQuery["Base"] = mySMILESdf.Base
resQuery["Mass"] = mySMILESdf.mass
resQuery["HBA"] = mySMILESdf.HBA
resQuery["HBD"] = mySMILESdf.HBD
resQuery["PSA"] = mySMILESdf.PSA
resQuery["RBC"] = mySMILESdf.RBC
resQuery["HAC"] = mySMILESdf.HAC
resQuery["FractionAromatic"] = mySMILESdf.FractionAromatic

print rfdf.predict(resQuery)
print rfdf.predict_proba(resQuery)
['1']
[[ 0.23  0.77]]

or in a file as before.

The Jupyter notebook describing the model building can be downloaded below, together with the cleaned dataset used to generate it. In addition the model can also be downloaded rfdf.pickle.

The python notebook can be downloaded from here AMESmachineLearning.ipyn
The cleaned data used to generate the model can be downloaded here AMESdata.sdf
The actual model can be downloaded from here rfdf.pickle
The notebook for using the stored model AMESmachineLearningModelOnly.ipynb.zip

Last updated 27 January 2017.