Macs in Chemistry

Insanely great science


Physicochemical Property Profile iPython Notebook

I've been making increasing use of iPython notebooks, both as a way to perform calculations but also as a way of cataloging the work that I've been doing. One thing I seem to be doing quite regularly is calculating physicochemical properties for libraries of compounds and then creating a trellis of plots to show each of the calculated properties. In the past I've done this with a series of scripts using several applications. This seemed an ideal task to try out using an iPython notebook.

Since the calculation involves shape calculations we need to start with 3D structures, these were generated using MOE and I also calculated Normalized ratios of principal moments of inertia descriptors DOI and ensured the ID of each molecule was stored in the molecule. This was then exported in sdf file format.

The file format should look like this


The iPython Notebook

First we need to import the necessary modules

from StringIO import StringIO
import pandas as pd
import numpy as np
import pandas as pd
from scipy import stats, integrate
import matplotlib.pyplot as plt
import seaborn as sns
from rdkit import Chem
from rdkit.Chem import AllChem
from numpy import linalg
from rdkit.Chem import PandasTools
import os
from rdkit import RDConfig
import subprocess

We are going to use matplotlib to generate the plots, %matplotlib inline allows the plots to be shown within the notebook directly below the code cell that produced it. The resulting plots will then also be stored in the notebook document.

%matplotlib inline

Then we define the path to the sdf file we want to process

sdfFilePath = '/Users/username/Desktop/FragmentLibrary.sdf'

The first calculation uses uses the ChemAxon Evaluator to calculate a number of properties. These include IDNUMBER', 'SMILES', 'logP', 'logD', 'apka', 'bpka', 'atomCount', 'mass', 'HBA', 'HBD', 'PSA', 'RBC', 'HAC', 'FractionAromatic'. The subprocess module is used to call evaluate and the output captured.

p = subprocess.Popen(['/Applications/ChemAxon/MarvinBeans/bin/evaluate', sdfFilePath, '-g', '-e', "field('IDNUMBER');molString('smiles');logp(); logd('7.4'); apka('1'); bpka('1'); atomCount(); mass(); acceptorcount(); donorcount(); topologicalPolarSurfaceArea(); rotatablebondcount(); atomCount()-atomCount('1');aromaticAtomCount()/(atomCount()-atomCount('1'))"], stdout=subprocess.PIPE)

output = p.communicate()[0]

The results are then read into a Pandas dataframe, note the output uses ";" as the delimiter.

df = pd.read_csv(StringIO(output), sep=';', names= ['IDNUMBER', 'SMILES', 'logP', 'logD', 'apka', 'bpka', 'atomCount', 'mass', 'HBA', 'HBD', 'PSA', 'RBC', 'HAC', 'FractionAromatic'])

We can check it has been read correctly by looking at the first few rows. I've removed the SMILES for clarity.

IDNUMBER SMILES logP logD apka bpka atomCount mass HBA HBD PSA RBC HAC FractionAromatic
0 MOLID1 [...] -1.13 -1.13 NaN -5.95 34 252.27 4 0 74.76 3 18 0.00
1 MOLID2 [...] 2.37 -0.87 NaN 10.28 44 246.40 0 2 21.05 4 18 0.33
2 MOLID3 [...] 0.52 0.52 10.78 -0.08 24 214.24 3 1 68.29 5 14 0.36
3 MOLID4 [...] 1.09 1.09 18.78 -3.65 32 233.27 2 0 46.61 4 17 0.35
4 MOLID5 [...] 1.31 1.31 NaN -3.68 29 219.24 2 0 46.61 3 16 0.38

We now need to annotate the results with whether a molecule is acid, base, zwitterion or neutral. First we add a couple of columns flagging whether the molecule contains an acid or basic group based on the calculated pKa. Then we define each molecule based on the logical calculations.

df['Base'] = df['bpka'] > 8
df['Acid'] = df['apka'] < 6.5

def calculate_ABZN(row):
     if row['Acid'] and row['Base']:
                return 'Zwitterion'
        elif row['Acid'] and not row['Base']:
                return 'Acid'
        elif not row['Acid'] and row['Base']:
                return 'Base'
        elif not row['Acid'] and not row['Base']:
                return 'Neutral'

df['ABZN'] = df.apply(calculate_ABZN, axis=1)

We now load the original sdf file into a Pandas data frame, and ensure the npr descriptors are the correct datatype.

moedf = PandasTools.LoadSDF(sdfFilePath,molColName='Molecule')

#convert npri and npr2 to datatype float
moedf[['npr1', 'npr2']] = moedf[['npr1', 'npr2']].astype(float)

We can then define the plane of best fit (PBF) calculation using rdkit submitted code DOI. Usage requires a local installation of Eigen (, which is used internally to solve the eigenvalue problem.

def GetBestFitPlane(pts,weights=None):
    if weights is None:
        wSum = len(pts)
        origin = np.sum(pts,0)
    origin /= wSum

    sums = np.zeros((3,3),np.double)
    for pt in pts:
        dp = pt-origin
        for i in range(3):
            sums[i,i] += dp[i]*dp[i]
            for j in range(i+1,3):
                sums[i,j] += dp[i]*dp[j]
                sums[j,i] += dp[i]*dp[j]
    sums /= wSum
    vals,vects = linalg.eigh(sums)
    order = np.argsort(vals)
    normal = vects[:,order[0]]
    plane = np.zeros((4,),np.double)
    plane[:3] = normal
    plane[3] = -1 *
    return plane

def PBFRD(mol,confId=-1):
    conf = mol.GetConformer(confId)
    if not conf.Is3D():
        return 0

    pts = np.array([list(conf.GetAtomPosition(x)) for x in range(mol.GetNumAtoms())])
    plane = GetBestFitPlane(pts)
    denom =[:3],plane[:3])
    denom = denom**0.5
    # add up the distance from the plane for each point:
    res = 0.0
    for pt in pts:
        res += np.abs([:3])+plane[3])
    res /= denom
    res /= len(pts)
    return res

Then read in the original sdf file, perform the PBF calculation and read the results into a data frame.

suppl = Chem.SDMolSupplier(sdfFilePath, removeHs=False)
pbfs = []
for mol in suppl:
        pbfs.append((mol.GetProp('IDNUMBER'), PBFRD(mol)))

pbfdf = pd.DataFrame(pbfs, columns=['IDNUMBER', 'PBF'])

We now merge the different data frames using the IDNUMBER as the join key.

resultCA_PBF = pd.merge(df, pbfdf, left_on='IDNUMBER', right_on='IDNUMBER')
resultsdf = pd.merge(resultCA_PBF, moedf, left_on='IDNUMBER', right_on='IDNUMBER')

To plot the results we are going to use Seaborn a Python visualization library based on matplotlib. First we define all the properties we want to plot.

myLogP = resultsdf['logP']
myLogD = resultsdf['logD']
myRBC = resultsdf['RBC']
myHBA = resultsdf['HBA']
myHBD = resultsdf['HBD']
myMass = resultsdf['mass']
myTPSA = resultsdf['PSA']
myHAC = resultsdf['HAC']
myFraromatic = resultsdf['FractionAromatic']
myABZN = resultsdf['ABZN']
myPBF = resultsdf['PBF']
myNP1 = resultsdf['npr1']
myNP2 = resultsdf['npr2']

To plot we take advantage of the ability to define multiple plots laid out in a grid, in this case 3 rows and 4 columns.

For categorical data we can use countplot to show the counts of observations in each categorical bin using bars. For continuous data we can use distplot to Flexibly plot a univariate distribution of observations. To generate a scatterplot we use regplot

Depending on the plot we can alter the number of bins or the order in which they are displayed.

fig, ((ax1, ax2, ax3, ax4),(ax5, ax6, ax7, ax8), (ax9, ax10, ax11, ax12)) = plt.subplots(nrows=3, ncols=4, figsize=(15,10))
sns.distplot(myLogP,kde=False, bins =10, hist_kws={"alpha": 1, "color": "blue"}, ax=ax1);#to unmute colors
sns.distplot(myLogD,kde=False, bins =10, color='blue',ax=ax2);
sns.distplot(myMass,kde=False, hist_kws={"alpha": 1, "color": "red"}, ax=ax3);
sns.distplot(myTPSA,kde=False, ax=ax4);
sns.countplot(myHBA, color='green', ax=ax5);
sns.countplot(myHBD, color='green',ax=ax6);
sns.countplot(myHAC, color='red', ax=ax7);
sns.countplot(myRBC, color='red', ax=ax8);
sns.countplot(myABZN, order = ['Acid', 'Base', 'Neutral', 'Zwitterion'], ax=ax9);
sns.distplot(myFraromatic,kde=False, hist_kws={"alpha": 1, "color": "blue"}, ax=ax10);
sns.distplot(myPBF,kde=False, bins =5, color='blue',ax=ax11);
sns.regplot(myNP1, myNP2, fit_reg=False, ax=ax12)


Finally we save the image

fig.savefig('test2png.png', dpi=100)

The iPython Notebook can be downloaded from here

Last updated 7 March 2016.