A recent paper "The Catch-22 of Predicting hERG Blockade Using Publicly Accessible Bioactivity Data" DOI described a classification model for HERG activity. I was delighted to see that all the datasets used in the study, including the training and external datasets, and the models generated using these datasets were provided as individual data files (CSV) and Python Jupyter notebooks, respectively, on GitHub https://github.com/AGPreissner/Publications).
The models were downloaded and the Random Forest Jupyter Notebooks (using RDKit) modified to save the generated model using pickle to store the predictive model, and then another Jupyter notebook was created to access the model without the need to rebuild the model each time. This notebook was exported as a python script to allow command line access, and Vortex scripts created that allow the user to run the model within Vortex and import the results and view the most significant features.
All models and scripts are available for download.
Greg Landrum's ICCS 2018 presentation on slideshare
The details of some of the projects taking part in the Google Summer of Code are now online here https://summerofcode.withgoogle.com/organizations/6513013473935360/ under the Open Chemistry header.
Really interesting work includes 3-D coordinate generation, standardising fingerprint APIs, a framework for molecular validation, and standardization and molecular dynamics in Avogadro.
Good luck to all that are taking part!!
I just saw this on the RDKit email circulation list and since I know a number of readers use RDKit I thought I'd mention it.
When we do the beta for the 2018.03.1 release we're going to switch the C++ backend to use modern C++ (=C++11). For people who can't switch to use that code, we will continue to provide bug fixes for the 2017.09 release for at least another 6 months.
This should only affect people who need to build the RDKit C++ code themselves. If you use a binary version of the RDKit like the ones available inside of Anaconda Python or KNIME, this change should have no impact upon you.
It looks like we're almost there. Hopefully we will be able to do a beta of the 2018.03 release by the end of the week.
I've posted about Samson a couple of times and it just keeps getting better and better.
SAMSON is a novel software platform for computational nanoscience. Rapidly build models of nanotubes, proteins, and complex nanosystems. Run interactive simulations to simulate chemical reactions, bend graphene sheets, (un)fold proteins. SAMSON's generic architecture makes it suitable for material science, life science, physics, electronics, chemistry, and even education. SAMSON is developed by the NANO-D group at INRIA, and means "Software for Adaptive Modeling and Simulation Of Nanosystems.
A recent blog post highlights the use of RDKit in Samson.
In this post I will present you the RDKit-SMILES Manager module that I integrated in the SAMSON platform. As some of you know, RDKit is an open source toolkit for cheminformatics which is widely used in the bioinformatics research. One of its features is the conversion of molecules from their SMILES code to a 2D and 3D structures. Thanks to the new SAMSON Element, it is now possible to use these features in the SAMSON platform. SMILES code files (.smi) or text files (.txt) containing several SMILES codes can be read using the import button.
The new module allows you to import a file containing SMILES strings, generate 2D depictions, and by right-clicking on these images, you can open, generate the 3D structure in SAMSON or save the image as png or svg.
It is also possible to run substructure searching using SMARTS.
An interesting paper on chemrxiv DOI
Matched Molecular Pair Analysis (MMPA) enables the automated and systematic compilation of medicinal chemistry rules from compound/property datasets. Here we present mmpdb, an open source Matched Molecular Pair (MMP) platform to create, compile, store, retrieve, and use MMP rules. mmpdb is suitable for the large datasets typically found in pharmaceutical and agrochemical companies and provides new algorithms for fragment canonicalization and stereochemistry handling. The platform is written in Python and based on the RDKit toolkit. It is freely available from https://github.com/rdkit/mmpdb
There are a number of interesting projects being undertaken in this years Google Summer of Code.
If you know of any students that might be interested then perhaps point them to the Open Chemistry Project.
The Open Chemistry project is a collection of open source, cross platform libraries and applications for the exploration, analysis and generation of chemical data. The organization is an umbrella of leading projects developed by long-time collaborators and innovators in open chemistry such as the Avogadro, Open Babel, and cclib projects. These three alone have been downloaded over 700,000 times and cited in over 2,000 academic papers. Our goal is to improve the state of the art, and facilitate the open exchange of chemical data and ideas while utilizing the best technologies from quantum chemistry codes, molecular dynamics, informatics, analytics, and visualization.
There is a list of the GSoC Ideas 2018 here but of course students can add their own.
MayaChemTools is a fabulous collection of Perl and Python scripts, modules, and classes to support a variety of day-to-day computational discovery needs.
The core set of command line Perl scripts available in the current release of MayaChemTools has no external dependencies and provide functionality for the following tasks:
- Manipulation and analysis of data in SD, CSV/TSV, sequence/alignments, and PDB files
- Listing information about data in SD, CSV/TSV, Sequence/Alignments, PDB, and fingerprints files
- Calculation of a key set of physicochemical properties, such as molecular weight, hydrogen bond donors and acceptors, logP, and topological polar surface area
- Generation of 2D fingerprints corresponding to atom neighborhoods, atom types, E-state indices, extended connectivity, MACCS keys, path lengths, topological atom pairs, topological atom triplets, topological atom torsions, topological pharmacophore atom pairs, and topological pharmacophore atom triplets
- Generation of 2D fingerprints with atom types corresponding to atomic invariants, DREIDING, E-state, functional class, MMFF94, SLogP, SYBYL, TPSA and UFF
- Similarity searching and calculation of similarity matrices using available 2D fingerprints
- Listing properties of elements in the periodic table, amino acids, and nucleic acids
- Exporting data from relational database tables into text files
The command line Python scripts based on RDKit provide functionality for the following tasks:
- Calculation of molecular descriptors
- Comparison 3D molecules based on RMSD and shape
- Conversion between different molecular file formats
- Enumeration of compound libraries and stereoisomers
- Filtering molecules using SMARTS, PAINS, and names of functional groups
- Generation of graph and atomic molecular frameworks
- Generation of images for molecules
- Performing structure minimization and conformation generation based on distance geometry and forcefields
- Picking and clustering molecules based on 2D fingerprints and various clustering methodologies
- Removal of duplicate molecules
These invaluable scripts can be used in other applications, I've written a Vortex Script that uses them.
An interesting paper uses 1,808,938 reactions from the patent literature as a training set to build a model to predict reactions.
There is an intuitive analogy of an organic chemist's understanding of a compound and a language speaker's understanding of a word. Consequently, it is possible to introduce the basic concepts and analyze potential impacts of linguistic analysis to the world of organic chemistry. In this work, we cast the reaction prediction task as a translation problem by introducing a template-free sequence-to-sequence model, trained end-to-end and fully data-driven. We propose a novel way of tokenization, which is arbitrarily extensible with reaction information. With this approach, we demonstrate results superior to the state-of-the-art solution by a significant margin on the top-1 accuracy. Specifically, our approach achieves an accuracy of 80.1% without relying on auxiliary knowledge such as reaction templates. Also, 66.4% accuracy is reached on a larger and noisier dataset.
There is also a brief video describing the work.
Pharmacelera we have written a python script to generate conformations with RDKit and made it available here .
Conformer generation is one of the first and most important steps in most ligand based experiments, particularly when the ligand’s 3D structure is unknown. For example, the quality of the conformers could affect the results of virtual screening experiments.
I just saw this message on the rdkit mailing list and I thought I'd flag it.
I've noticed a problem with anaconda python on the Mac. This may also be a problem on linux, but I haven't tested that yet.
Due to some changes in the way the anaconda team is doing python builds, the most recent conda python builds seem to no longer work with the RDKit. The symptom is an error message like "Fatal Python error: PyThreadState_Get: no current thread" when you try to import the rdkit.
I've observed this for the newest 3.5 (3.5.4-hf91e95415) and 3.6 (3.6.2-hd0bf7f115) builds. A workaround is to downgrade to 3.5.3 (conda install python=3.5.3) or 3.6.1 (conda install python=3.6.1).
Greg Landrum posted the following to the RDKit users and since a couple of the Jupyter Notebooks I've published make extensive use of RDKit I thought I'd flag it.
As many of you are no doubt aware, the Python community plans to discontinue support for Python 2 in 2020. A growing number of projects in the Scientific Python stack are making the same transition and have made that explicit here: http://www.python3statement.org/
I will be adding the RDKit to this list. The RDKit will switch to support only Python 3 by 2020. At some point between now and then - likely during the 2018.09 release cycle - we will create a maintenance branch for Python 2 that will continue to get bug fixes but will no longer have new Python features added. This branch will be maintained, and we will keep doing Python 2 builds, until 2020 when official Python 2 support ends.
Additionally, starting during the 2018.03 release cycle we will accept contributions for new features that are not compatible with Python 2 as long as those features are implemented in such a way that they don't break existing Python 2 code (more on this later). This will allow members of the RDKit community who have made the switch to Python 3 to start making use of the new features of the language in their RDKit contributions.
If you have not made the switch yet to Python 3: please read the web page I link to above and take a look at the list of projects that have committed to transition. The switch from Python 2 to Python 3 isn't always easy, but it's not getting any easier with time and you have a few years to complete it. There are a lot of online resources available to help.
Best Regards, -greg
The list of projects that will be making the transition so far includes; IPython, Jupyter notebook, pandas, Matplotlib SymPy, Astropy, Software Carpentry, SunPy xonsh, scikit-bio, PyStan, Axelrod osBrain, PyMeasure, rpy2, PyMC3, FEniCS, An Introduction to Applied Bioinformatics, music21, QIIME, Altair, gala, cual-id, CIS
The generation of multiple conformations is an important step in a number of operations from input to ab initio calculations to providing input files for docking studies. A recent paper compared seven freely available conformer ensemble generators: Balloon (two different algorithms), the RDKit standard conformer ensemble generator, the Experimental-Torsion basic Knowledge Distance Geometry (ETKDG) algorithm, Confab, Frog2 and Multiconf-DOCK DOI, and also provided a dataset of ligand conformations taken from the PDB.
A recent twitter discussion involving Greg Landrum and David Koes prompted Greg to publish a blog post describing conformation generation within RDKit. The post compares using distance geometry to select diverse conformations versus an approach that combines the distance geometry approach with experimental torsion-angle preferences obtained from small-molecule crystallographic data (ETKDG). He also looks at the impact of force-field minimisation.
A really interesting read with code provided.
There has been an interesting discussion about installing rdkit-postgresql95 on Mac OS X on the rdkit mailing list and I thought it might be of wider interest.
Here's the resolution of the difficulties I was having installing rdkit-postgresql95 on Mac OS X. The problem turned out to be that the package originally posted used Py3.5, and I'm still using 2.7. I may change to 3.5 at some point, but Greg was kind enough to add a 2.7 version of the package.
So, the following invocations work to set up rdkit with the cartridge in a new env on Mac OS X. I'm on El Capitan, by the way, and for clarity, I've not tested the installation, but only checked that it completed successfully.
conda create -n rdk1 -c rdkit rdkit . activate rdk1 conda install -c greglandrum rdkit-postgresql95
(The last command also installs postgresql 9.5.4-0.)
I’ve just been made aware of an issue with one of the Calculated properties iPython Notebook.
The latest update to Pandas
the respective piece of the pandas API got restructured for 0.18.1 and that the “format" module got moved from pandas.core to pandas.formats:
The consequence is that PandasTools now raises an error on attempting to import molecules into a data frame.
from rdkit.Chem import PandasTools df = PandasTools.LoadSDF("demo.sdf") AttributeError Traceback (most recent call last) /Users/philopon/mysrc/python/mordred/.direnv/python-3.5.1/lib/python3.5/site-packages/IPython/core/formatters.py in __call__(self, obj) 341 method = _safe_get_formatter_method(obj, self.print_method) 342 if method is not None: --> 343 return method() 344 return None 345 else: /Users/philopon/mysrc/python/mordred/.direnv/python-3.5.1/lib/python3.5/site-packages/pandas/core/frame.py in _repr_html_(self) 566 567 return self.to_html(max_rows=max_rows, max_cols=max_cols, --> 568 show_dimensions=show_dimensions, notebook=True) 569 else: 570 return None /usr/local/Cellar/rdkit-python/2016.03.1/lib/python3.5/site-packages/rdkit/Chem/PandasTools.py in patchPandasHTMLrepr(self, **kwargs) 129 Patched default escaping of HTML control characters to allow molecule image rendering dataframes 130 ''' --> 131 formatter = pd.core.format.DataFrameFormatter(self,buf=None,columns=None,col_space=None,colSpace=None,header=True,index=True, 132 na_rep='NaN',formatters=None,float_format=None,sparsify=None,index_names=True, 133 justify = None, force_unicode=None,bold_rows=True,classes=None,escape=False) AttributeError: module 'pandas.core' has no attribute 'format'
At the moment the only solution is to make sure you are using Pandas version 0.18.0
pip uninstall pandas pip install pandas==0.18.0
One of the issues for machine learning models in helping understand structure activity relationships (SAR) is providing a nice chemist friendly visualisation. This excellent blog post provides a description of how to colour code the parts of molecules that are predicted to contribute to an activity.
RDkit has been updated .
If you used home-brew to install RDkit as described here updating is very simple
brew update brew upgrade rdkit
You can check which version you have installed using
MacPro> python Python 2.7.11 (default, Dec 23 2015, 16:11:50) [GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from rdkit import rdBase >>> print rdBase.rdkitVersion 2016.03.1 >>>
I've been making increasing use of iPython notebooks, both as a way to perform calculations but also as a way of cataloging the work that I've been doing. One thing I seem to be doing quite regularly is calculating physicochemical properties for libraries of compounds and then creating a trellis of plots to show each of the calculated properties. In the past I've done this with a series of applescripts using several applications. This seemed an ideal task to try out using an iPython notebook.
MongoDB (from "humongous") is an open-source object orientated document database.
Classified as a NoSQL database, MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster.
As you might expect chemical searching is not something that is traditionally supported, but there have been a couple of blog articles describing initial efforts, and there is now a detailed step by step description available. The post described implementation of chemical similarity searching using MongoDB and RDKit fingerprints it also has some initial comparisons with the more traditional SQL implementation using the RDKit PostgreSQL cartridge.
Andrew Dalke has just released fmcs-1.0. It finds a maximum common substructure of two or more structures. Some of the features are:
- handles 1,000s of structures
- several different atom and bond comparison schemes
- modifiers to require ring bonds only match ring bonds, or that incomplete rings are not allowed in the MCS
- user-defined atom class typing through isotope labels (SMILES) or through an SD tag field
- uses an exact solution to find a maximum common substructure
- eports the current best solution if the timeout is reached
The software is distributed under the 2-clause BSD license and available for no charge from https://bitbucket.org/dalke/fmcs/downloads/fmcs-1.0.tar.gz
You must have the Python bindings to RDKit in order to run fmcs.
Usage details are in the README, shown also in the project page at: https://bitbucket.org/dalke/fmcs/