Macs in Chemistry

Insanely great science

 

Accessing Jupyter Notebook model from Vortex

I've become a great fan of Jupyter Notebooks as a way of modelling cheminformatics data, and I've published some of the notebooks here.

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.

In the predicting AMES activity notebook I also looked at the use of pickle to store the predictive model and then access it using a Jupyter notebook without the need to rebuild the model. Whilst a notebook is a nice way to access the predictive model it might also be useful to be able to access it from other applications or the command line.

I noticed that one of the options in the Jupyter notebook is to download as Python (.py), I'm not sure if this is used very often since it was actually downloaded with .html extension.

pythondownload

However this did provide a good starting point for a script to take a SMILES string as input and report the result to stdout.

To get the SMILES input we use

mySMILES = sys.argv[1]

This gets the first argument after the script as the SMILES string. At the end we convert the number returned to a text string that is a little more informative.

pred = rfdf.predict(resQuery).astype(str).astype(int)
if pred == 1:
    result = "Positive"
if pred == 0:
    result = "Negative" 

print(result)

and then return the result.

The Python Script

#!/usr/bin/env python
# -*- coding: utf-8 -*-


#Machine learning using AMES test set. AMESdata.sdf file created/washed using MOE
#Benchmark Data Set for in Silico Prediction of Ames Mutagenicity
#Katja Hansen†, Sebastian Mika‡, Timon Schroeter†, Andreas Sutter§, Antonius ter Laak§, Thomas Steger-Hartmann§, Nikolaus Heinrich§ and Klaus-Robert Müller*†
#University of Technology, Berlin, Germany, idalab GmbH, Berlin, Germany, and Bayer Schering Pharma AG, Berlin, Germany
#J. Chem. Inf. Model., 2009, 49 (9), pp 2077–2081
#DOI: http://dx.doi.org/10.1021/ci900161g
#Link to model creation http://www.macinchem.org/reviews/ipython/AMESmodels.php


#Tested using Python 3.6.1
#Requires evaluate from Chemaxon (https://docs.chemaxon.com/display/docs/Chemical+Terms+Evaluator)
from rdkit import Chem #rdkit 2017.03.3
from rdkit.Chem import PandasTools
import pandas as pd #pandas==0.17.1
import pandas_ml as pdml #pandas-ml==0.4.0
from rdkit.Chem import AllChem, DataStructs
import numpy #numpy==1.11.0
from sklearn.model_selection import train_test_split #scikit-learn==0.19
import subprocess
from io import StringIO
import io
import pickle
import os
import sys
# get_ipython().magic('matplotlib inline')



#Assumes pickle file is in same directory as script
#print('Loading from pickle')
path = os.path.join(os.path.dirname(__file__), 'rfdf.pickle')
with open(path, 'rb') as f:
    rfdf = pickle.load(f)

#Need to get SMILES from input

mySMILES = sys.argv[1]

#Testing a novel SMILES string, 
#mySMILES ='N1C=CC2=C(C=CC=C12)C1=C2C=C3C=CC=CC3=CC2=CC=C1'
from rdkit import RDConfig
mySMILESdf1 = pd.DataFrame(columns=['Smiles', 'ID']) #need ID to match dataframes later
mySMILESdf1 = mySMILESdf1.append({'Smiles':mySMILES, 'ID':123}, ignore_index=True)   
PandasTools.AddMoleculeColumnToFrame(mySMILESdf1,'Smiles','Molecule')
def getFparr( mol ):
    fp = AllChem.GetMorganFingerprintAsBitVect( mol,2 )
    arr = numpy.zeros((1,))
    DataStructs.ConvertToNumpyArray( fp , arr )
    return arr
mySMILESdf1['mfp2']  = mySMILESdf1.Molecule.apply(getFparr)
mySMILESdf1



p = subprocess.Popen(['/Applications/ChemAxon/MarvinBeans/bin/evaluate', mySMILES , '-g', '-e', "molString('smiles');logp(); logd('7.4'); apka('1'); bpka('1'); atomCount(); mass(); acceptorcount(); donorcount(); topologicalPolarSurfaceArea(); rotatablebondcount(); atomCount()-atomCount('1');aromaticAtomCount()/(atomCount()-atomCount('1'))"], stdout=subprocess.PIPE)

output = p.communicate()[0]

#remove commas evaluate sometimes puts into IDNUMBERs
output = output.decode("utf-8")
output=output.replace(',','')
mySMILESdf = pd.read_csv(StringIO(output), sep=';', names= ['SMILES', 'logP', 'logD', 'apka', 'bpka', 'atomCount', 'mass', 'HBA', 'HBD', 'PSA', 'RBC', 'HAC', 'FractionAromatic'])
mySMILESdf['Base'] = mySMILESdf['bpka'] > 7.5 #true or false
mySMILESdf['Acid'] = mySMILESdf['apka'] < 7.5
mySMILESdf.Base = mySMILESdf.Base.astype(int) #convert to boolean int
mySMILESdf.Acid = mySMILESdf.Acid.astype(int)
mySMILESdf['ID']= 123
mySMILESdf


mySMILESdf = pd.merge(mySMILESdf, mySMILESdf1, left_on='ID', right_on='ID')


resQuery = pd.DataFrame(mySMILESdf.mfp2.tolist())
resQuery["LogP"] = mySMILESdf.logP
resQuery["LogD"] = mySMILESdf.logD
resQuery["Acid"] = mySMILESdf.Acid
resQuery["Base"] = mySMILESdf.Base
resQuery["Mass"] = mySMILESdf.mass
resQuery["HBA"] = mySMILESdf.HBA
resQuery["HBD"] = mySMILESdf.HBD
resQuery["PSA"] = mySMILESdf.PSA
resQuery["RBC"] = mySMILESdf.RBC
resQuery["HAC"] = mySMILESdf.HAC
resQuery["FractionAromatic"] = mySMILESdf.FractionAromatic

resQuery


#uncomment next two lines to check from command line
#print(rfdf.predict(resQuery))
#print(rfdf.predict_proba(resQuery))

#['1']
#[[ 0.23  0.77]]


pred = rfdf.predict(resQuery).astype(str).astype(int)
if pred == 1:
    result = "Positive"
if pred == 0:
    result = "Negative" 

print(result)

Testing the Python Script

I did my testing using the command line, if you uncomment the print statements in the above script you can follow the process

In a Terminal window type this (you will need to include the full path to the script and it assumes the pickle file is in the same folder as the python script.

python /Users/username/Desktop/AMESTest/AMESmachineLearningModelOnlySMILESinput.py "[H]OS(=O)C1=C([H])C([H])=C(Cl)C([H])=C1[H]"
Loading from pickle
['0']
[[ 0.95  0.05]]
Negative

This seems to be working fine, so you can comment out the print statements other than the last "print(result)".

We can now access this python script from a Vortex script using subprocess. Note we have to use explicit paths to ensure the script uses the correct python, you will need to edit these paths.

The Vortex Script

# Using python AMES model
# 

import sys
import com.dotmatics.vortex.util.Util as Util
import subprocess
import os



smiles = ''
col = vtable.findColumnWithName('Structure', 0)
rows = vtable.getRealRowCount()
for r in range(0, int(rows)):

    if (col == None):
        vortex.alert('Load a workspace with a Structure column please.')
        quit()

    else:
        smiles = vortex.getMolProperty(vtable.getMolFileManager().getMolFileAtRow(r), 'SMILES')

    #vortex.alert('SMILES = '+smiles)


    p = subprocess.Popen(['/usr/local/miniconda/bin/python', '/Users/Chris/Desktop/AMESTest/AMESmachineLearningModelOnlySMILESinput.py', smiles], stdout=subprocess.PIPE)

    output = p.communicate()[0]

    colAMES = vtable.findColumnWithName('AMES Prediction',1)
    colAMES.setValueFromString(r, output)
vtable.fireTableStructureChanged()

#command line access    
#python /Users/Chris/Desktop/AMESTest/AMESmachineLearningModelOnlySMILESinput.py "N1C=CC2=C(C=CC=C12)C1=C2C=C3C=CC=CC3=CC2=CC=C1"

The result can be seen in the image below

ames

You can download the Python and Vortex scripts and pickle file here AMESTest.zip, they were created using Python 3.6 and the pickle file requires that you have access to ChemAxon Evaluate to calculate the logP/D and pKa data etc.

This should be regarded as a "proof of concept" it shows that it is possible to access user built machine learning models to provide custom calculations within Vortex. At the moment all the paths are hard coded to a local instance, it would be better if the models were stored on a central server that anyone could access. This would mean that everyone was using the same versions of the models and it would be much easier to maintain. I've seen Jupyter notebooks for using a wide variety of machine learning and Artificial Intelligence tools so this could bring access to a huge variety of models. If anyone has any they would like to share I'd be happy to include them.

Last Updated 11 June 2018