MayaChemTools is a fabulous collection of Perl and Python scripts, modules, and classes to support a variety of day-to-day computational discovery needs.
The core set of command line Perl scripts available in the current release of MayaChemTools has no external dependencies and provide functionality for the following tasks:
- Manipulation and analysis of data in SD, CSV/TSV, sequence/alignments, and PDB files
- Listing information about data in SD, CSV/TSV, Sequence/Alignments, PDB, and fingerprints files
- Calculation of a key set of physicochemical properties, such as molecular weight, hydrogen bond donors and acceptors, logP, and topological polar surface area
- Generation of 2D fingerprints corresponding to atom neighborhoods, atom types, E-state indices, extended connectivity, MACCS keys, path lengths, topological atom pairs, topological atom triplets, topological atom torsions, topological pharmacophore atom pairs, and topological pharmacophore atom triplets
- Generation of 2D fingerprints with atom types corresponding to atomic invariants, DREIDING, E-state, functional class, MMFF94, SLogP, SYBYL, TPSA and UFF
- Similarity searching and calculation of similarity matrices using available 2D fingerprints
- Listing properties of elements in the periodic table, amino acids, and nucleic acids
- Exporting data from relational database tables into text files
The command line Python scripts based on RDKit provide functionality for the following tasks:
- Calculation of molecular descriptors
- Comparison 3D molecules based on RMSD and shape
- Conversion between different molecular file formats
- Enumeration of compound libraries and stereoisomers
- Filtering molecules using SMARTS, PAINS, and names of functional groups
- Generation of graph and atomic molecular frameworks
- Generation of images for molecules
- Performing structure minimization and conformation generation based on distance geometry and forcefields
- Picking and clustering molecules based on 2D fingerprints and various clustering methodologies
- Removal of duplicate molecules
These invaluable scripts can be used in other applications, I've written a Vortex Script that uses them.
An interesting paper uses 1,808,938 reactions from the patent literature as a training set to build a model to predict reactions.
There is an intuitive analogy of an organic chemist's understanding of a compound and a language speaker's understanding of a word. Consequently, it is possible to introduce the basic concepts and analyze potential impacts of linguistic analysis to the world of organic chemistry. In this work, we cast the reaction prediction task as a translation problem by introducing a template-free sequence-to-sequence model, trained end-to-end and fully data-driven. We propose a novel way of tokenization, which is arbitrarily extensible with reaction information. With this approach, we demonstrate results superior to the state-of-the-art solution by a significant margin on the top-1 accuracy. Specifically, our approach achieves an accuracy of 80.1% without relying on auxiliary knowledge such as reaction templates. Also, 66.4% accuracy is reached on a larger and noisier dataset.
There is also a brief video describing the work.
Pharmacelera we have written a python script to generate conformations with RDKit and made it available here .
Conformer generation is one of the first and most important steps in most ligand based experiments, particularly when the ligand’s 3D structure is unknown. For example, the quality of the conformers could affect the results of virtual screening experiments.
I just saw this message on the rdkit mailing list and I thought I'd flag it.
I've noticed a problem with anaconda python on the Mac. This may also be a problem on linux, but I haven't tested that yet.
Due to some changes in the way the anaconda team is doing python builds, the most recent conda python builds seem to no longer work with the RDKit. The symptom is an error message like "Fatal Python error: PyThreadState_Get: no current thread" when you try to import the rdkit.
I've observed this for the newest 3.5 (3.5.4-hf91e95415) and 3.6 (3.6.2-hd0bf7f115) builds. A workaround is to downgrade to 3.5.3 (conda install python=3.5.3) or 3.6.1 (conda install python=3.6.1).
Greg Landrum posted the following to the RDKit users and since a couple of the Jupyter Notebooks I've published make extensive use of RDKit I thought I'd flag it.
As many of you are no doubt aware, the Python community plans to discontinue support for Python 2 in 2020. A growing number of projects in the Scientific Python stack are making the same transition and have made that explicit here: http://www.python3statement.org/
I will be adding the RDKit to this list. The RDKit will switch to support only Python 3 by 2020. At some point between now and then - likely during the 2018.09 release cycle - we will create a maintenance branch for Python 2 that will continue to get bug fixes but will no longer have new Python features added. This branch will be maintained, and we will keep doing Python 2 builds, until 2020 when official Python 2 support ends.
Additionally, starting during the 2018.03 release cycle we will accept contributions for new features that are not compatible with Python 2 as long as those features are implemented in such a way that they don't break existing Python 2 code (more on this later). This will allow members of the RDKit community who have made the switch to Python 3 to start making use of the new features of the language in their RDKit contributions.
If you have not made the switch yet to Python 3: please read the web page I link to above and take a look at the list of projects that have committed to transition. The switch from Python 2 to Python 3 isn't always easy, but it's not getting any easier with time and you have a few years to complete it. There are a lot of online resources available to help.
Best Regards, -greg
The list of projects that will be making the transition so far includes; IPython, Jupyter notebook, pandas, Matplotlib SymPy, Astropy, Software Carpentry, SunPy xonsh, scikit-bio, PyStan, Axelrod osBrain, PyMeasure, rpy2, PyMC3, FEniCS, An Introduction to Applied Bioinformatics, music21, QIIME, Altair, gala, cual-id, CIS
The generation of multiple conformations is an important step in a number of operations from input to ab initio calculations to providing input files for docking studies. A recent paper compared seven freely available conformer ensemble generators: Balloon (two different algorithms), the RDKit standard conformer ensemble generator, the Experimental-Torsion basic Knowledge Distance Geometry (ETKDG) algorithm, Confab, Frog2 and Multiconf-DOCK DOI, and also provided a dataset of ligand conformations taken from the PDB.
A recent twitter discussion involving Greg Landrum and David Koes prompted Greg to publish a blog post describing conformation generation within RDKit. The post compares using distance geometry to select diverse conformations versus an approach that combines the distance geometry approach with experimental torsion-angle preferences obtained from small-molecule crystallographic data (ETKDG). He also looks at the impact of force-field minimisation.
A really interesting read with code provided.
There has been an interesting discussion about installing rdkit-postgresql95 on Mac OS X on the rdkit mailing list and I thought it might be of wider interest.
Here's the resolution of the difficulties I was having installing rdkit-postgresql95 on Mac OS X. The problem turned out to be that the package originally posted used Py3.5, and I'm still using 2.7. I may change to 3.5 at some point, but Greg was kind enough to add a 2.7 version of the package.
So, the following invocations work to set up rdkit with the cartridge in a new env on Mac OS X. I'm on El Capitan, by the way, and for clarity, I've not tested the installation, but only checked that it completed successfully.
conda create -n rdk1 -c rdkit rdkit . activate rdk1 conda install -c greglandrum rdkit-postgresql95
(The last command also installs postgresql 9.5.4-0.)
I’ve just been made aware of an issue with one of the Calculated properties iPython Notebook.
The latest update to Pandas
the respective piece of the pandas API got restructured for 0.18.1 and that the “format" module got moved from pandas.core to pandas.formats:
The consequence is that PandasTools now raises an error on attempting to import molecules into a data frame.
from rdkit.Chem import PandasTools df = PandasTools.LoadSDF("demo.sdf") AttributeError Traceback (most recent call last) /Users/philopon/mysrc/python/mordred/.direnv/python-3.5.1/lib/python3.5/site-packages/IPython/core/formatters.py in __call__(self, obj) 341 method = _safe_get_formatter_method(obj, self.print_method) 342 if method is not None: --> 343 return method() 344 return None 345 else: /Users/philopon/mysrc/python/mordred/.direnv/python-3.5.1/lib/python3.5/site-packages/pandas/core/frame.py in _repr_html_(self) 566 567 return self.to_html(max_rows=max_rows, max_cols=max_cols, --> 568 show_dimensions=show_dimensions, notebook=True) 569 else: 570 return None /usr/local/Cellar/rdkit-python/2016.03.1/lib/python3.5/site-packages/rdkit/Chem/PandasTools.py in patchPandasHTMLrepr(self, **kwargs) 129 Patched default escaping of HTML control characters to allow molecule image rendering dataframes 130 ''' --> 131 formatter = pd.core.format.DataFrameFormatter(self,buf=None,columns=None,col_space=None,colSpace=None,header=True,index=True, 132 na_rep='NaN',formatters=None,float_format=None,sparsify=None,index_names=True, 133 justify = None, force_unicode=None,bold_rows=True,classes=None,escape=False) AttributeError: module 'pandas.core' has no attribute 'format'
At the moment the only solution is to make sure you are using Pandas version 0.18.0
pip uninstall pandas pip install pandas==0.18.0
One of the issues for machine learning models in helping understand structure activity relationships (SAR) is providing a nice chemist friendly visualisation. This excellent blog post provides a description of how to colour code the parts of molecules that are predicted to contribute to an activity.
RDkit has been updated .
If you used home-brew to install RDkit as described here updating is very simple
brew update brew upgrade rdkit
You can check which version you have installed using
MacPro> python Python 2.7.11 (default, Dec 23 2015, 16:11:50) [GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from rdkit import rdBase >>> print rdBase.rdkitVersion 2016.03.1 >>>
I've been making increasing use of iPython notebooks, both as a way to perform calculations but also as a way of cataloging the work that I've been doing. One thing I seem to be doing quite regularly is calculating physicochemical properties for libraries of compounds and then creating a trellis of plots to show each of the calculated properties. In the past I've done this with a series of applescripts using several applications. This seemed an ideal task to try out using an iPython notebook.
MongoDB (from "humongous") is an open-source object orientated document database.
Classified as a NoSQL database, MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster.
As you might expect chemical searching is not something that is traditionally supported, but there have been a couple of blog articles describing initial efforts, and there is now a detailed step by step description available. The post described implementation of chemical similarity searching using MongoDB and RDKit fingerprints it also has some initial comparisons with the more traditional SQL implementation using the RDKit PostgreSQL cartridge.
Andrew Dalke has just released fmcs-1.0. It finds a maximum common substructure of two or more structures. Some of the features are:
- handles 1,000s of structures
- several different atom and bond comparison schemes
- modifiers to require ring bonds only match ring bonds, or that incomplete rings are not allowed in the MCS
- user-defined atom class typing through isotope labels (SMILES) or through an SD tag field
- uses an exact solution to find a maximum common substructure
- eports the current best solution if the timeout is reached
The software is distributed under the 2-clause BSD license and available for no charge from https://bitbucket.org/dalke/fmcs/downloads/fmcs-1.0.tar.gz
You must have the Python bindings to RDKit in order to run fmcs.
Usage details are in the README, shown also in the project page at: https://bitbucket.org/dalke/fmcs/