Macs in Chemistry

Insanely Great Science

RDKit code changes


I just saw this on the RDKit email circulation list and since I know a number of readers use RDKit I thought I'd mention it.

When we do the beta for the 2018.03.1 release we're going to switch the C++ backend to use modern C++ (=C++11). For people who can't switch to use that code, we will continue to provide bug fixes for the 2017.09 release for at least another 6 months.

This should only affect people who need to build the RDKit C++ code themselves. If you use a binary version of the RDKit like the ones available inside of Anaconda Python or KNIME, this change should have no impact upon you.

It looks like we're almost there. Hopefully we will be able to do a beta of the 2018.03 release by the end of the week.


RDkit in Samson


I've posted about Samson a couple of times and it just keeps getting better and better.

SAMSON is a novel software platform for computational nanoscience. Rapidly build models of nanotubes, proteins, and complex nanosystems. Run interactive simulations to simulate chemical reactions, bend graphene sheets, (un)fold proteins. SAMSON's generic architecture makes it suitable for material science, life science, physics, electronics, chemistry, and even education. SAMSON is developed by the NANO-D group at INRIA, and means "Software for Adaptive Modeling and Simulation Of Nanosystems.

A recent blog post highlights the use of RDKit in Samson.

In this post I will present you the RDKit-SMILES Manager module that I integrated in the SAMSON platform. As some of you know, RDKit is an open source toolkit for cheminformatics which is widely used in the bioinformatics research. One of its features is the conversion of molecules from their SMILES code to a 2D and 3D structures. Thanks to the new SAMSON Element, it is now possible to use these features in the SAMSON platform. SMILES code files (.smi) or text files (.txt) containing several SMILES codes can be read using the import button.

The new module allows you to import a file containing SMILES strings, generate 2D depictions, and by right-clicking on these images, you can open, generate the 3D structure in SAMSON or save the image as png or svg.


It is also possible to run substructure searching using SMARTS.


mmpdb: An Open Source Matched Molecular Pair Platform for Large Multi-Property Datasets


An interesting paper on chemrxiv DOI

Matched Molecular Pair Analysis (MMPA) enables the automated and systematic compilation of medicinal chemistry rules from compound/property datasets. Here we present mmpdb, an open source Matched Molecular Pair (MMP) platform to create, compile, store, retrieve, and use MMP rules. mmpdb is suitable for the large datasets typically found in pharmaceutical and agrochemical companies and provides new algorithms for fragment canonicalization and stereochemistry handling. The platform is written in Python and based on the RDKit toolkit. It is freely available from


Google Summer of Code:- Open Chemistry


There are a number of interesting projects being undertaken in this years Google Summer of Code.

If you know of any students that might be interested then perhaps point them to the Open Chemistry Project.

The Open Chemistry project is a collection of open source, cross platform libraries and applications for the exploration, analysis and generation of chemical data. The organization is an umbrella of leading projects developed by long-time collaborators and innovators in open chemistry such as the Avogadro, Open Babel, and cclib projects. These three alone have been downloaded over 700,000 times and cited in over 2,000 academic papers. Our goal is to improve the state of the art, and facilitate the open exchange of chemical data and ideas while utilizing the best technologies from quantum chemistry codes, molecular dynamics, informatics, analytics, and visualization.

There is a list of the GSoC Ideas 2018 here but of course students can add their own.


MayaChem Tools


MayaChemTools is a fabulous collection of Perl and Python scripts, modules, and classes to support a variety of day-to-day computational discovery needs.

The core set of command line Perl scripts available in the current release of MayaChemTools has no external dependencies and provide functionality for the following tasks:

  • Manipulation and analysis of data in SD, CSV/TSV, sequence/alignments, and PDB files
  • Listing information about data in SD, CSV/TSV, Sequence/Alignments, PDB, and fingerprints files
  • Calculation of a key set of physicochemical properties, such as molecular weight, hydrogen bond donors and acceptors, logP, and topological polar surface area
  • Generation of 2D fingerprints corresponding to atom neighborhoods, atom types, E-state indices, extended connectivity, MACCS keys, path lengths, topological atom pairs, topological atom triplets, topological atom torsions, topological pharmacophore atom pairs, and topological pharmacophore atom triplets
  • Generation of 2D fingerprints with atom types corresponding to atomic invariants, DREIDING, E-state, functional class, MMFF94, SLogP, SYBYL, TPSA and UFF
  • Similarity searching and calculation of similarity matrices using available 2D fingerprints
  • Listing properties of elements in the periodic table, amino acids, and nucleic acids
  • Exporting data from relational database tables into text files

The command line Python scripts based on RDKit provide functionality for the following tasks:

  • Calculation of molecular descriptors
  • Comparison 3D molecules based on RMSD and shape
  • Conversion between different molecular file formats
  • Enumeration of compound libraries and stereoisomers
  • Filtering molecules using SMARTS, PAINS, and names of functional groups
  • Generation of graph and atomic molecular frameworks
  • Generation of images for molecules
  • Performing structure minimization and conformation generation based on distance geometry and forcefields
  • Picking and clustering molecules based on 2D fingerprints and various clustering methodologies
  • Removal of duplicate molecules

These invaluable scripts can be used in other applications, I've written a Vortex Script that uses them.


“Found in Translation”: Predicting Outcomes of Complex Organic Chemistry Reactions


An interesting paper uses 1,808,938 reactions from the patent literature as a training set to build a model to predict reactions.

There is an intuitive analogy of an organic chemist's understanding of a compound and a language speaker's understanding of a word. Consequently, it is possible to introduce the basic concepts and analyze potential impacts of linguistic analysis to the world of organic chemistry. In this work, we cast the reaction prediction task as a translation problem by introducing a template-free sequence-to-sequence model, trained end-to-end and fully data-driven. We propose a novel way of tokenization, which is arbitrarily extensible with reaction information. With this approach, we demonstrate results superior to the state-of-the-art solution by a significant margin on the top-1 accuracy. Specifically, our approach achieves an accuracy of 80.1% without relying on auxiliary knowledge such as reaction templates. Also, 66.4% accuracy is reached on a larger and noisier dataset.

There is also a brief video describing the work.


RDKit conformer generation script


Pharmacelera we have written a python script to generate conformations with RDKit and made it available here .

Conformer generation is one of the first and most important steps in most ligand based experiments, particularly when the ligand’s 3D structure is unknown. For example, the quality of the conformers could affect the results of virtual screening experiments.


Rdkit warning


I just saw this message on the rdkit mailing list and I thought I'd flag it.

I've noticed a problem with anaconda python on the Mac. This may also be a problem on linux, but I haven't tested that yet.

Due to some changes in the way the anaconda team is doing python builds, the most recent conda python builds seem to no longer work with the RDKit. The symptom is an error message like "Fatal Python error: PyThreadState_Get: no current thread" when you try to import the rdkit.

I've observed this for the newest 3.5 (3.5.4-hf91e95415) and 3.6 (3.6.2-hd0bf7f115) builds. A workaround is to downgrade to 3.5.3 (conda install python=3.5.3) or 3.6.1 (conda install python=3.6.1).


RDKit and Python3


Greg Landrum posted the following to the RDKit users and since a couple of the Jupyter Notebooks I've published make extensive use of RDKit I thought I'd flag it.

As many of you are no doubt aware, the Python community plans to discontinue support for Python 2 in 2020. A growing number of projects in the Scientific Python stack are making the same transition and have made that explicit here:

I will be adding the RDKit to this list. The RDKit will switch to support only Python 3 by 2020. At some point between now and then - likely during the 2018.09 release cycle - we will create a maintenance branch for Python 2 that will continue to get bug fixes but will no longer have new Python features added. This branch will be maintained, and we will keep doing Python 2 builds, until 2020 when official Python 2 support ends.

Additionally, starting during the 2018.03 release cycle we will accept contributions for new features that are not compatible with Python 2 as long as those features are implemented in such a way that they don't break existing Python 2 code (more on this later). This will allow members of the RDKit community who have made the switch to Python 3 to start making use of the new features of the language in their RDKit contributions.

If you have not made the switch yet to Python 3: please read the web page I link to above and take a look at the list of projects that have committed to transition. The switch from Python 2 to Python 3 isn't always easy, but it's not getting any easier with time and you have a few years to complete it. There are a lot of online resources available to help.

Best Regards, -greg

The list of projects that will be making the transition so far includes; IPython, Jupyter notebook, pandas, Matplotlib SymPy, Astropy, Software Carpentry, SunPy xonsh, scikit-bio, PyStan, Axelrod osBrain, PyMeasure, rpy2, PyMC3, FEniCS, An Introduction to Applied Bioinformatics, music21, QIIME, Altair, gala, cual-id, CIS


Conformer generation


The generation of multiple conformations is an important step in a number of operations from input to ab initio calculations to providing input files for docking studies. A recent paper compared seven freely available conformer ensemble generators: Balloon (two different algorithms), the RDKit standard conformer ensemble generator, the Experimental-Torsion basic Knowledge Distance Geometry (ETKDG) algorithm, Confab, Frog2 and Multiconf-DOCK DOI, and also provided a dataset of ligand conformations taken from the PDB.

A recent twitter discussion involving Greg Landrum and David Koes prompted Greg to publish a blog post describing conformation generation within RDKit. The post compares using distance geometry to select diverse conformations versus an approach that combines the distance geometry approach with experimental torsion-angle preferences obtained from small-molecule crystallographic data (ETKDG). He also looks at the impact of force-field minimisation.

A really interesting read with code provided.


RDkit and Conda install of postgres cartridge on Mac OS


There has been an interesting discussion about installing rdkit-postgresql95 on Mac OS X on the rdkit mailing list and I thought it might be of wider interest.

Here's the resolution of the difficulties I was having installing rdkit-postgresql95 on Mac OS X. The problem turned out to be that the package originally posted used Py3.5, and I'm still using 2.7. I may change to 3.5 at some point, but Greg was kind enough to add a 2.7 version of the package.

So, the following invocations work to set up rdkit with the cartridge in a new env on Mac OS X. I'm on El Capitan, by the way, and for clarity, I've not tested the installation, but only checked that it completed successfully.

conda create -n rdk1 -c rdkit rdkit
. activate rdk1
conda install -c greglandrum rdkit-postgresql95

(The last command also installs postgresql 9.5.4-0.)


iPython Notebook issue


I’ve just been made aware of an issue with one of the Calculated properties iPython Notebook.

The latest update to Pandas

the respective piece of the pandas API got restructured for 0.18.1 and that the “format" module got moved from pandas.core to pandas.formats:

The consequence is that PandasTools now raises an error on attempting to import molecules into a data frame.

from rdkit.Chem import PandasTools
df = PandasTools.LoadSDF("demo.sdf")

AttributeError                          Traceback (most recent call last)
/Users/philopon/mysrc/python/mordred/.direnv/python-3.5.1/lib/python3.5/site-packages/IPython/core/ in __call__(self, obj)
    341             method = _safe_get_formatter_method(obj, self.print_method)
    342             if method is not None:
--> 343                 return method()
    344             return None
    345         else:

/Users/philopon/mysrc/python/mordred/.direnv/python-3.5.1/lib/python3.5/site-packages/pandas/core/ in _repr_html_(self)
    567             return self.to_html(max_rows=max_rows, max_cols=max_cols,
--> 568                                 show_dimensions=show_dimensions, notebook=True)
    569         else:
    570             return None

/usr/local/Cellar/rdkit-python/2016.03.1/lib/python3.5/site-packages/rdkit/Chem/ in patchPandasHTMLrepr(self, **kwargs)
    129   Patched default escaping of HTML control characters to allow molecule image rendering dataframes
    130   '''
--> 131   formatter = pd.core.format.DataFrameFormatter(self,buf=None,columns=None,col_space=None,colSpace=None,header=True,index=True,
   132                                                na_rep='NaN',formatters=None,float_format=None,sparsify=None,index_names=True,
    133                                                justify = None, force_unicode=None,bold_rows=True,classes=None,escape=False)

AttributeError: module 'pandas.core' has no attribute 'format'

At the moment the only solution is to make sure you are using Pandas version 0.18.0

pip uninstall pandas    
pip install pandas==0.18.0


SAR visualization with RDKit


One of the issues for machine learning models in helping understand structure activity relationships (SAR) is providing a nice chemist friendly visualisation. This excellent blog post provides a description of how to colour code the parts of molecules that are predicted to contribute to an activity.



RDkit updated


RDkit has been updated .

If you used home-brew to install RDkit as described here updating is very simple

brew update
brew upgrade rdkit

You can check which version you have installed using

MacPro> python
Python 2.7.11 (default, Dec 23 2015, 16:11:50) 
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from rdkit import rdBase
>>> print rdBase.rdkitVersion


iPython Notebook to calc physicochemical properties


I've been making increasing use of iPython notebooks, both as a way to perform calculations but also as a way of cataloging the work that I've been doing. One thing I seem to be doing quite regularly is calculating physicochemical properties for libraries of compounds and then creating a trellis of plots to show each of the calculated properties. In the past I've done this with a series of applescripts using several applications. This seemed an ideal task to try out using an iPython notebook.




Chemical similarity search in MongoDB


MongoDB (from "humongous") is an open-source object orientated document database.

Classified as a NoSQL database, MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster.

As you might expect chemical searching is not something that is traditionally supported, but there have been a couple of blog articles describing initial efforts, and there is now a detailed step by step description available. The post described implementation of chemical similarity searching using MongoDB and RDKit fingerprints it also has some initial comparisons with the more traditional SQL implementation using the RDKit PostgreSQL cartridge.


FMCS 1.0 - Find Maximum Common Substructure

Andrew Dalke has just released fmcs-1.0. It finds a maximum common substructure of two or more structures. Some of the features are:

  • handles 1,000s of structures
  • several different atom and bond comparison schemes
  • modifiers to require ring bonds only match ring bonds, or that incomplete rings are not allowed in the MCS
  • user-defined atom class typing through isotope labels (SMILES) or through an SD tag field
  • uses an exact solution to find a maximum common substructure
  • eports the current best solution if the timeout is reached

The software is distributed under the 2-clause BSD license and available for no charge from

You must have the Python bindings to RDKit in order to run fmcs.

Usage details are in the README, shown also in the project page at: