Macs in Chemistry

Insanely great science


A workflow for docking/virtual screening

Whilst high-throughput screening (HTS) has been the starting point for many successful drug discovery programs the cost of screening, the lack of accessibility of a large diverse sample collection, or low throughput of the primary assay may preclude HTS as a starting point and identification of a smaller selection of compounds with a higher probability of being a hit may be desired. Directed or Virtual screening is a computational technique used in drug discovery research designed to identify potential hits for evaluation in primary assays. It involves the rapid in silico assessment of large libraries of chemical structures in order to identify those structures that most likely to be active against a drug target. The in silico screen can be based on known ligand similarity or based on docking ligands into the desired binding site.

In this workflow I'll be looking at using docking to identify potential hits. There are a number of docking algorithms available some are listed below

AutoDock Vina is reported to be orders of magnitude faster than AutoDock whilst improving binding mode predictions. Smina is a fork of Autodock Vina that focuses on improving scoring and minimization. More details are disclosed in this publication DOI. I used smina and there are pre-built binaries available for Mac OSX and Linux.

Ligands for docking

Whilst there are a number of sites that offer compilations of structures for docking probably the most comprehensive is ZINC a free database of commercially-available compounds for virtual screening containing 35 million purchasable structures. The ZINC structures are nicely categorised so you can download subsets based on calculated physicochemical properties.

Once downloaded you will need to generate multiple reasonable conformations for each molecule. Whilst systematic enumeration of all conformational space is possible, with flexible ligands in can rapidly lead to an explosion in the number of possible conformations. For a reviews see DOI and DOI.

The generation of conformations for small molecules is a problem of continuing interest in cheminformatics and computational drug discovery. This review will present an overview of methods used to sample conformational space, focusing on those methods designed for organic molecules commonly of interest in drug discovery.

Since the release 2015.09.1, a new conformer generator method is available in the RDKit, termed ETKDG this is a knowledge-based method that combines the distance geometry approach with experimental torsion-angle preferences obtained from small-molecule crystallographic data DOI. A jupyter notebook was created to run the conformation generation, using the parallelisation code contribution from Andrew Dalke and described in the rdkit cookbook, code is shown below and the notebook can be downloaded here. One thing to note is that to aid molecule tracking the Molecule name field needs to be populated, this the row just below the header in the sdf file. There is a definition of the SDF file format here.


from collections import defaultdict
import sys
import numpy as np  #<span style="font:11px Menlo-Regular; color:#000000;">1.12.0</span>
from rdkit import Chem # rdkit 2017.03.1
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
%pylab inline

#determine number of cores available but don't use all cores
numcores =!sysctl -n hw.ncpu
max_workers = int(numcores[0]) -2
Out[30]: 10

#check number of structures
inputfile = [x for x in Chem.SDMolSupplier('/Users/Chris/Desktop/ZincLeadsNow/21_p1.4.sdf',removeHs=False)]
Out[31]: 149313

# Download this from
from concurrent import futures

# Download this from
import progressbar
#seems not to work in notebook?

# This function is called in the subprocess.
# The parameters (molecule and number of conformers) are passed via a Python

ps = AllChem.ETKDG()
#Edit for number of confs desired eg n = 50
def generateconformations(m, n, name):
    m = Chem.AddHs(m)
    ids=AllChem.EmbedMultipleConfs(m, n, ps)
    for id in ids:
        AllChem.UFFOptimizeMolecule(m, confId=id)
    return m, list(ids), name

smi_input_file, sdf_output_file = sys.argv[1:3]

writer = Chem.SDWriter('/Users/Chris/Desktop/ZincLeadsNow/21_p1.4confs.sdf')
suppl = [x for x in Chem.SDMolSupplier('/Users/Chris/Desktop/ZincLeadsNow/21_p1.4.sdf',removeHs=False)]
#suppl = Chem.SmilesMolSupplier(smi_input_file, titleLine=False)

# for mol in suppl:
#     print(mol.GetPropsAsDict(includePrivate=True).get('_Name'))

with futures.ProcessPoolExecutor(max_workers=max_workers) as executor:
    # Submit a set of asynchronous jobs
    jobs = []
    for mol in suppl:
        if mol:
            name = mol.GetProp('_Name')
            job = executor.submit(generateconformations, mol, n, name)

    widgets = ["Generating conformations; ", progressbar.Percentage(), " ",
               progressbar.ETA(), " ", progressbar.Bar()]
    pbar = progressbar.ProgressBar(widgets=widgets, maxval=len(jobs))
    for job in pbar(futures.as_completed(jobs)):
        mol, ids, name = job.result()
        mol.SetProp('_Name', name)
        for id in ids:
            writer.write(mol, confId=id)

The target protein

More than 58 million children are afflicted annually with diarrheal disease associated with the most prevalent infections of the small intestine, including Escherichia coli, Rotavirus, Giardia lamblia, and Cryptosporidium parvum, which ultimately results in the death of 2.5 million children. C. parvum is an obligate parasite in the same phylum of Apicomplexa as Plasmodium and the same order of Eucoccidiorida as Toxoplasma and Eimeria. It is one of the pathogenic agents responsible for cryptosporidiosis, a zoonotic and enteric disease. Children in resource-poor settings are particularly at risk, not only with an increased incidence of Cryptosporidium spp. infection, but also with increased acute and long-lasting morbidity.

The Protein Data Bank is a fantastic resource that at the time of writing contains 41305 Distinct Protein Sequences. For this study I'll be using 2wei the kinase domain of Cryptosporidium parvum calcium dependent protein kinase in complex with 3-MB-PP1 DOI. Whilst scientists will often debate the accuracy and validity of the algorithms used it is equally important to check the quality of the starting protein structure.

We tend to believe that PDB files are a message from God (Derek Lowe, In The Pipeline).

The quality of an X-ray crystal structure can vary and it is useful to understand how the model is derived from the electron density maps. The crystallographer builds a 3D model of the protein attempts to fit it to the electron density then uses the built structure to generate an electron density map, they then use this to derive a difference map to show where the modelled structure does not match the experimental density map. The modelled structure is then adjusted and the process repeated until they have a refined structure.

As the image below highlights the resolution can have a significant impact on the quality of the eventual structure. Untitled

If you are downloading a PDB structure the page for each entry provides a lot of very useful information, giving the resolution and a graphical display of various parameters (red=Poor, Blue=Good).


The Molecular Description provides information about the protein but also indicates which residues are in the crystal structure, many crystal structures may have been modified to aid crystallisation. This information is also contained within the header of the PDB file you can download and can be read in a text editor.

Unfortunately, even when working with a high-resolution x-ray crystallographic structure, researchers can spend considerable time and effort correcting common problems such as missing hydrogen atoms, incomplete side chains and loops, ambiguous protonation states, and flipped residues. Fortunately there are a number of software tools that can be used to tidy up the structures such as the "Protein Preparation tools" in MOE or Schrödinger.

Once the pub file was downloaded it was opened in MOE and the structure corrected, the ligand was then selected and saved in PDB format the protein minus the ligand was also then saved in PDB format.


Some of the potential issues are :-

We could check that all is working by redocking the ligand in the x-ray structure but that seems a slightly trivial exercise, it would be better to use a range of known ligands and use them in a docking study.

There is currently no data in ChEMBL related to this target but there are a number of other CDPK1 targets that have been the subject of screening efforts. CDPK1 is a potential malaria target and the results of a screen are available with confirmation of hits identified. It might be interesting to dock molecules shown to be active at this related protein to see if useful starting points can be identified.


The 581 molecules with IC50 data were downloaded and imported into MOE, the structures were then "washed" to move counterions, correct structures etc. then exported in SDF format and subjected to conformation generation to yield a file containing 2579 conformations.

Running the docking

With Smina installed it is relatively easy to run the docking from the command line. The —cpu option allows you define how many cores to use, I usually keep a couple back to keep my machine responsive to user input. Since I have multiple cores on my MacPro I define how many to use, for reproducibility, we specify a random number seed. The bounding box for docking is specified automatically with the autobox ligand option which creates a box with an 8 ĚŠA buffer around the provided ligand. The results are saved in a compressed sdf file.

#Docking using smina
'/usr/local/bin/smina.osx' --cpu 10 --seed 0 --autobox_ligand '/Users/Chris/Projects/DockingTestCase/ligands.pdb' -r '/Users/Chris/Projects/DockingTestCase/2wei_alligned.pdb' -l '/Users/Chris/Projects/DockingTestCase/ChemblBioactivityConfs.sdf' -o '/Users/Chris/Projects/DockingTestCase/All_Docked.sdf.gz''

Analysis of the results

The output from the docking run is a compressed sdf file that can be read into Vortex as shown below, the output includes an energy calculation that can be used to select molecules for further evaluation. However it is also possible to apply multiple scoring functions and the use of multiple scoring functions is now well established.


Recently, machine-learning scoring functions trained on protein-ligand complexes have shown significant promise an example being (RF-Score-VS) trained on 15 426 active and 893 897 inactive molecules docked to a set of 102 targets DOI.

Our results show RF-Score-VS can substantially improve virtual screening performance: RF-Score-VS top 1% provides 55.6% hit rate, whereas that of Vina only 16.2% (for smaller percent the difference is even more encouraging: RF-Score-VS top 0.1% achieves 88.6% hit rate for 27.5% using Vina). In addition, RF-Score-VS provides much better prediction of measured binding affinity than Vina (Pearson correlation of 0.56 and −0.18, respectively). Lastly, we test RF-Score-VS on an independent test set from the DEKOIS benchmark and observed comparable results.

Binaries for RF-Score-VS are available and requires only minimal input

Thus the command line is

'/usr/local/bin/rf-score-vs_v1/rf-score-vs', '--receptor', pdbFile, sdfFile, '-o', 'csv', '--field', 'name', '--field', 'RFScoreVS_v2'

This can be accessed via a Vortex script, first we get the path to the imported SDF file, then use a dialog box to get the path to the PDB file used for the docking. Then we construct the command for rescoring and submit it. Finally we parse the output returned and populate the workspace.

Vortex script for rescoring docking results

#script to rescore binding results
# Performance of machine-learning scoring functions in structure-based virtual screening
# Maciej Wjcikowski, Pedro J. Ballester & Pawel Siedlecki
# doi:

import sys
import os
import subprocess

# Get the path to the currently open sdf file
sdfFile = vortex.getFileForPropertyCalculation(vtable)

#uncomment for testing

#if file:
#   vortex.alert(str(sdfFile))

# Get path to PDB for docking

# Open a dialog to choose a file
# getFile(title, extensions, 0 = Open, 1 = Save)
# vortex will keep track of the last folder you looked in etc

file = vortex.getFile("Choose the pdb the ligands were docked into", [".pdb"], 0)
#display file path, uncomment for testing
#if file:
#   vortex.alert(str(file))


#column = vtable.findColumnWithName('RFScoreVS_v2', 1, 3)

p = subprocess.Popen(['/usr/local/bin/rf-score-vs_v1/rf-score-vs', '--receptor', pdbFile, sdfFile, '-o', 'csv', '--field', 'name', '--field', 'RFScoreVS_v2'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)

output = p.communicate()[0]

# Parse the output, first line names for column

lines = output.split('\r\n')
colName = lines[0].split(',')
for c in colName:
    column = vtable.findColumnWithName(c, 1)

rows = lines[1:len(lines)]
for r in range(0, vtable.getRealRowCount()):
    vals = rows[r].split(',')
    for j in range(0, len(vals)):
        column = vtable.findColumnWithName(colName[j], 0)
        column.setValueFromString(r, vals[j])

This script rescores all the docking poses and populates the table as shown below. Sometimes it can be a little difficult to discern the 3D structures displayed in Vortex so a useful trick is to click on the "Tools" menu and select "Calculate Properties", then in the dialog box select the check box alongside "SMILES code of molecule". An additional column will be added that contains the SMILES string rendered as a 2D molecule.



The next task is to then select the molecules for biological screening. You can work your way through the list line by line looking at the docking or you can simply choose the lowest energy pose for each molecule. The following vortex script selects the lowest energy pose, it requires that there is a unique identifier (name) for each row in the workspace. Once selected you can then export the selected structures to a new workspace. In a similar manner you can select the highest scoring poses.

Vortex script for selecting lowest energy

#Selecting lowest energy poses from docking run.

#Assumes there is a column containing unique identifier for each molecule

# Vortex imports
import javax
import com.dotmatics.vortex.util.Layout as layout
import com.dotmatics.vortex.components as components
import time
import com.dotmatics.vortex.util.Util as Util
import com.dotmatics.vortex.mol2img.jni.genImage as genImage
import com.dotmatics.vortex.mol2img.Mol2Img as mol2Img
import jarray
import binascii
import string
import os

SelColumn = vtable.findColumnWithName("Selected",1)

input_label = swing.JLabel("ID column")
input_cb = workspace.getColumnComboBox()
panel = swing.JPanel()

layout.fill(panel, input_label, 0, 0)
layout.fill(panel, input_cb,    1, 0)

retID = vortex.showInDialog(panel, "Choose ID column")

if retID == vortex.OK:
    input_idx = input_cb.getSelectedIndex()

    if input_idx == 0:
        vortex.alert("you must choose a column")
        colname = vtable.getColumn(input_idx - 1)

input_label = swing.JLabel("Select Energy column")
input_cb = workspace.getColumnComboBox()
panel = swing.JPanel()

layout.fill(panel, input_label, 0, 0)
layout.fill(panel, input_cb,    1, 0)

ret = vortex.showInDialog(panel, "Choose column containing the energy")

if ret == vortex.OK:
    input_idx = input_cb.getSelectedIndex()

    if input_idx == 0:
        vortex.alert("you must choose a column")
        colscore = vtable.getColumn(input_idx - 1)

        bestScores = {}

        rows = vtable.getRealRowCount()
        for r in range(0, int(rows)):
            molName = colname.getValueAsString(r)
            molScore = colscore.getValue(r)

            # Update score in bestScores if not seen yet or if better than previous
            if molName not in bestScores or (molName in bestScores and molScore < bestScores[molName][0]):
                bestScores[molName] = (molScore, r)

        for score, r in bestScores.values():
            SelColumn.setValueFromString(r, "Selected")

        for score, r in bestScores.values():
            vtable.setRowSelected(r, 1) 


Exploring Docking Results

With a workspace containing the selected docking results we can use the embed AstexViewer to look at the docked poses

Add AstexViewer Script

#Add an AstexViewer plot

import com.dotmatics.vortex.plots as plots

oav = plots.OpenAstexViewerPlot()



frame = vws.addComponent(oav, "Structures")

frame.setSize(java.awt.Dimension(600, 600))

# Make all popup menus heavyweight (i.e. real components)
# so that they will appear above the OAV plot.
# Gives access to the Detached menu setting amongst other things
# which is useful for looking at structure on another monitor.

Right click on the plot area to access the plot setting dialog box, click on Configure single protein to navigate to the PDB file used in the docking.


Given that the ligands that were docked were taken ChEMBL we also have the activity at the related CDPK1 target and it is perhaps interesting to see how the docking results reflect the experimental affinities. The bioactivity data was appended to the workspace using the ChEMBL ID as the key. A plot of IC50 versus molecular weight (MW), (below) shows that these is no correlation between activity and molecular weight.


However if we plot the affinity score versus MW these appears to be a bias with higher molecular weight molecules yielding lower energies. Colour coding the molecules with IC50 <100 nM (green) serves to emphasise the bias towards higher molecular weight compounds. This bias is perhaps not surprising since there is little penalty for adding MW to the ligands and they can often potentially pick up additional binding.


If we do a similar analysis based on the poses selected based on RF-Score-VS (below) it does seem there is much less of a bias towards high molecular weight compounds.


The next step is to screen a large library of novel ligands, that is described in the following post A workflow for docking/virtual screening 2.

All the Vortex scripts can be downloaded here

Last Updated 25 August 2017