A Jupyter Notebook to compare similarity between molecules

This notebook demonstates how to get the structures and data from the master worksheet, then convert the SMILES to molecule objects then compare similarity. SMILES (Simplified Molecular Input Line Entry System) is a line notation (a typographical method using printable characters) for entering and representing molecules and reactions. https://www.daylight.com/dayhtml/doc/theory/theory.smiles.html

Getting the data

Import the required python modules and then import the example.tsv file into a Pandas dataframe called datafile

In [1]:
from rdkit.Chem import AllChem as Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import PandasTools
from rdkit.Chem import Draw
from rdkit import DataStructs

import numpy
import seaborn as sns
import matplotlib

import pandas as pd
In [2]:
#Allow inline images
%matplotlib inline
In [3]:
#If you want to read a local file then simply edit this filepath
#datafile = pd.read_csv('myfile.tsv', sep = '\t')

#The file format is tab separated text
#Mol_ID	SMILES_parent	Name
#OSA_000001	CN1CCN(CC1)c1ccc(cc1)C#N	4-(4-methylpiperazin-1-yl)benzonitrile
#OSA_000002	CN1CCN(CC1)C(=O)NC1=CC=C(F)C=C1	N-(4-fluorophenyl)-4-methylpiperazine-1-carboxamide

datafile = pd.read_csv('example2.tsv', sep = '\t')
In [4]:
#View first five rows
datafile.head(5)
Out[4]:
Mol_ID SMILES_parent Name
0 OSA_000001 CN1CCN(CC1)c1ccc(cc1)C#N 4-(4-methylpiperazin-1-yl)benzonitrile
1 OSA_000002 CN1CCN(CC1)C(=O)NC1=CC=C(F)C=C1 N-(4-fluorophenyl)-4-methylpiperazine-1-carbox...
2 OSA_000003 CC(CO)(CO)NC(=O)Nc1ccccc1 3-(1,3-dihydroxy-2-methylpropan-2-yl)-1-phenyl...
3 OSA_000004 CC(C)C(=O)Nc1cccc(c1)C#N N-(3-cyanophenyl)-2-methylpropanamide
4 OSA_000005 O=S1(CCN(CC1)Cc2ccc(C)cc2)=O 4-(4-methylbenzyl)thiomorpholine 1,1-dioxide
In [5]:
#Find how many rows
len(datafile.index)
Out[5]:
12

Convert the SMILES string to an RDKit molecular object

We can see the different datatypes in the dataframe

In [6]:
datafile.dtypes
Out[6]:
Mol_ID           object
SMILES_parent    object
Name             object
dtype: object

At the moment the molecule structures are represented by a SMILES string, we can convert the SMILES string to an RDKit molecular object and then display

Adding structures to pandas dataframe

We can now convert the SMILES string to a RDKit molecular object for every row in the dataframe

In [7]:
PandasTools.AddMoleculeColumnToFrame(datafile,'SMILES_parent','Molecule',includeFingerprints=True)
>>> print([str(x) for x in  datafile.columns])
['Mol_ID', 'SMILES_parent', 'Name', 'Molecule']
In [8]:
datafile.dtypes
Out[8]:
Mol_ID           object
SMILES_parent    object
Name             object
Molecule         object
dtype: object

If we view the dataframe the molecule object has been added to the last column. It would be better if the structure was more readily visible. So we change the column order.

In [9]:
datafile.head(3)
Out[9]:
Mol_ID SMILES_parent Name Molecule
0 OSA_000001 CN1CCN(CC1)c1ccc(cc1)C#N 4-(4-methylpiperazin-1-yl)benzonitrile Mol
1 OSA_000002 CN1CCN(CC1)C(=O)NC1=CC=C(F)C=C1 N-(4-fluorophenyl)-4-methylpiperazine-1-carboxamide Mol
2 OSA_000003 CC(CO)(CO)NC(=O)Nc1ccccc1 3-(1,3-dihydroxy-2-methylpropan-2-yl)-1-phenylurea Mol
In [10]:
#display the current order
cols = list(datafile.columns.values)
cols
Out[10]:
['Mol_ID', 'SMILES_parent', 'Name', 'Molecule']
In [11]:
#change the column order
datafile = datafile[['Mol_ID',
 'Molecule',
 'SMILES_parent',
 'Name']]
In [12]:
datafile.head(3)
Out[12]:
Mol_ID Molecule SMILES_parent Name
0 OSA_000001 Mol CN1CCN(CC1)c1ccc(cc1)C#N 4-(4-methylpiperazin-1-yl)benzonitrile
1 OSA_000002 Mol CN1CCN(CC1)C(=O)NC1=CC=C(F)C=C1 N-(4-fluorophenyl)-4-methylpiperazine-1-carboxamide
2 OSA_000003 Mol CC(CO)(CO)NC(=O)Nc1ccccc1 3-(1,3-dihydroxy-2-methylpropan-2-yl)-1-phenylurea

If we want to view all structures we can diaplay them as a grid

In [13]:
PandasTools.FrameToGridImage(datafile,column= 'Molecule', molsPerRow=4,subImgSize=(150,150),legendsCol="Mol_ID")
Out[13]:

Calculation of molecular similarities

Now calculate a variety of properties using RDKit, adding them to the end of the dataframe. you can choose which properties to add here.

In [14]:
fplist = [] #fplist
for mol in datafile['Molecule']:
    fp = Chem.GetMorganFingerprintAsBitVect( mol,2 )
    fplist.append(fp)
In [15]:
datafile['mfp2']=fplist
In [16]:
datafile.head(3)
Out[16]:
Mol_ID Molecule SMILES_parent Name mfp2
0 OSA_000001 Mol CN1CCN(CC1)c1ccc(cc1)C#N 4-(4-methylpiperazin-1-yl)benzonitrile [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]
1 OSA_000002 Mol CN1CCN(CC1)C(=O)NC1=CC=C(F)C=C1 N-(4-fluorophenyl)-4-methylpiperazine-1-carboxamide [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]
2 OSA_000003 Mol CC(CO)(CO)NC(=O)Nc1ccccc1 3-(1,3-dihydroxy-2-methylpropan-2-yl)-1-phenylurea [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]

Comparing the similarity of two molecules

In [17]:
fp1=datafile.at[0,'mfp2']
In [18]:
fp2=datafile.at[1,'mfp2']
In [19]:
from rdkit import DataStructs
DataStructs.DiceSimilarity(fp1,fp2)
Out[19]:
0.4
In [20]:
for r in datafile.index:
#r =0
    fp1 = datafile.at[r,'mfp2']
    colname = datafile.at[r,'Mol_ID']
    simlist = [] #fplist
    for mol in datafile['Molecule']:
        fp = Chem.GetMorganFingerprintAsBitVect( mol,2 )
        sim =DataStructs.DiceSimilarity(fp1,fp)
        simlist.append(sim)
    datafile[colname]=simlist
In [ ]:
 
In [21]:
datafile.head(3)
Out[21]:
Mol_ID Molecule SMILES_parent Name mfp2 OSA_000001 OSA_000002 OSA_000003 OSA_000004 OSA_000005 OSA_000006 OSA_000007 OSA_000008 Mymol1 Mymol2 Mymol3 Mymol4
0 OSA_000001 Mol CN1CCN(CC1)c1ccc(cc1)C#N 4-(4-methylpiperazin-1-yl)benzonitrile [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...] 1.000000 0.40000 0.150943 0.357143 0.307692 0.259259 0.145455 0.291667 0.350877 0.400000 0.372881 0.385965
1 OSA_000002 Mol CN1CCN(CC1)C(=O)NC1=CC=C(F)C=C1 N-(4-fluorophenyl)-4-methylpiperazine-1-carboxamide [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...] 0.400000 1.00000 0.379310 0.327869 0.315789 0.271186 0.166667 0.188679 0.741935 0.833333 0.718750 0.677419
2 OSA_000003 Mol CC(CO)(CO)NC(=O)Nc1ccccc1 3-(1,3-dihydroxy-2-methylpropan-2-yl)-1-phenylurea [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...] 0.150943 0.37931 1.000000 0.440678 0.218182 0.210526 0.137931 0.078431 0.366667 0.379310 0.387097 0.366667
In [22]:
#difficult to view dataframe so remove fingerprint column and others
newdatafile = datafile.drop(['mfp2','SMILES_parent',"Name"], axis=1)
In [23]:
newdatafile
Out[23]:
Mol_ID Molecule OSA_000001 OSA_000002 OSA_000003 OSA_000004 OSA_000005 OSA_000006 OSA_000007 OSA_000008 Mymol1 Mymol2 Mymol3 Mymol4
0 OSA_000001 Mol 1.000000 0.400000 0.150943 0.357143 0.307692 0.259259 0.145455 0.291667 0.350877 0.400000 0.372881 0.385965
1 OSA_000002 Mol 0.400000 1.000000 0.379310 0.327869 0.315789 0.271186 0.166667 0.188679 0.741935 0.833333 0.718750 0.677419
2 OSA_000003 Mol 0.150943 0.379310 1.000000 0.440678 0.218182 0.210526 0.137931 0.078431 0.366667 0.379310 0.387097 0.366667
3 OSA_000004 Mol 0.357143 0.327869 0.440678 1.000000 0.172414 0.166667 0.163934 0.111111 0.317460 0.327869 0.461538 0.317460
4 OSA_000005 Mol 0.307692 0.315789 0.218182 0.172414 1.000000 0.642857 0.175439 0.240000 0.305085 0.315789 0.295082 0.305085
5 OSA_000006 Mol 0.259259 0.271186 0.210526 0.166667 0.642857 1.000000 0.101695 0.153846 0.262295 0.271186 0.285714 0.295082
6 OSA_000007 Mol 0.145455 0.166667 0.137931 0.163934 0.175439 0.101695 1.000000 0.188679 0.193548 0.166667 0.187500 0.161290
7 OSA_000008 Mol 0.291667 0.188679 0.078431 0.111111 0.240000 0.153846 0.188679 1.000000 0.181818 0.188679 0.175439 0.181818
8 Mymol1 Mol 0.350877 0.741935 0.366667 0.317460 0.305085 0.262295 0.193548 0.181818 1.000000 0.580645 0.484848 0.500000
9 Mymol2 Mol 0.400000 0.833333 0.379310 0.327869 0.315789 0.271186 0.166667 0.188679 0.580645 1.000000 0.812500 0.741935
10 Mymol3 Mol 0.372881 0.718750 0.387097 0.461538 0.295082 0.285714 0.187500 0.175439 0.484848 0.812500 1.000000 0.727273
11 Mymol4 Mol 0.385965 0.677419 0.366667 0.317460 0.305085 0.295082 0.161290 0.181818 0.500000 0.741935 0.727273 1.000000

Contextual colouring of dataframe

We can also use contextual colouring on the dataframe, in this instance we are going to highlight similarity scores but it could be used to highlight affinity, IC50 or a calclated property like LogP.

You can create “heatmaps” with the background_gradient method. These require matplotlib, and here we use Seaborn to get a nice colormap.

In [24]:
import seaborn as sns

cm = sns.light_palette("red", as_cmap=True)
s = newdatafile.style.background_gradient(cmap=cm)
s
Out[24]:
Mol_ID Molecule OSA_000001 OSA_000002 OSA_000003 OSA_000004 OSA_000005 OSA_000006 OSA_000007 OSA_000008 Mymol1 Mymol2 Mymol3 Mymol4
0 OSA_000001 Mol 1 0.4 0.150943 0.357143 0.307692 0.259259 0.145455 0.291667 0.350877 0.4 0.372881 0.385965
1 OSA_000002 Mol 0.4 1 0.37931 0.327869 0.315789 0.271186 0.166667 0.188679 0.741935 0.833333 0.71875 0.677419
2 OSA_000003 Mol 0.150943 0.37931 1 0.440678 0.218182 0.210526 0.137931 0.0784314 0.366667 0.37931 0.387097 0.366667
3 OSA_000004 Mol 0.357143 0.327869 0.440678 1 0.172414 0.166667 0.163934 0.111111 0.31746 0.327869 0.461538 0.31746
4 OSA_000005 Mol 0.307692 0.315789 0.218182 0.172414 1 0.642857 0.175439 0.24 0.305085 0.315789 0.295082 0.305085
5 OSA_000006 Mol 0.259259 0.271186 0.210526 0.166667 0.642857 1 0.101695 0.153846 0.262295 0.271186 0.285714 0.295082
6 OSA_000007 Mol 0.145455 0.166667 0.137931 0.163934 0.175439 0.101695 1 0.188679 0.193548 0.166667 0.1875 0.16129
7 OSA_000008 Mol 0.291667 0.188679 0.0784314 0.111111 0.24 0.153846 0.188679 1 0.181818 0.188679 0.175439 0.181818
8 Mymol1 Mol 0.350877 0.741935 0.366667 0.31746 0.305085 0.262295 0.193548 0.181818 1 0.580645 0.484848 0.5
9 Mymol2 Mol 0.4 0.833333 0.37931 0.327869 0.315789 0.271186 0.166667 0.188679 0.580645 1 0.8125 0.741935
10 Mymol3 Mol 0.372881 0.71875 0.387097 0.461538 0.295082 0.285714 0.1875 0.175439 0.484848 0.8125 1 0.727273
11 Mymol4 Mol 0.385965 0.677419 0.366667 0.31746 0.305085 0.295082 0.16129 0.181818 0.5 0.741935 0.727273 1
In [ ]:
 
In [ ]: