datamol.io
In a recent post Pat Walters highlighted the use of molfeat in a google colab notebook https://colab.research.google.com/github/PatWalters/practicalcheminformaticstutorials/blob/main/mlmodels/QSARin8lines.ipynb.
I thought I'd also mention other tools available from Datamol.io https://datamol.io/#datamol.
datamol.io is an open-source toolkit that simplifies molecular processing and featurization workflows for ML scientists in drug discovery.
Cheminformatics support is all built upon the open-source toolkit RDKit https://rdkit.org. It can be installed using conda
conda install -c conda-forge datamol
Or pip
pip install datamol
The latest version (0.9) appears to need Python 3.9 and RDKit version [2022.03, 2022.09]
There is a comprehensive series of tutorials https://docs.datamol.io/stable/tutorials/The_Basics.html and an extensive documentation.
License is Apache version 2.0.
If you would like to contribute details are on GitHub https://github.com/datamol-io/datamol.
Jazzy a Python library to calculate a set of atomic/molecular descriptors
Just spotted a very interesting paper "Fast calculation of hydrogen-bond strengths and free energy of hydration of small molecules" DOI.
Jazzy is a Python library that allows you to calculate a set of atomic/molecular descriptors which include the Gibbs free energy of hydration (kJ/mol), its polar/apolar components, and the hydrogen-bond strength of donor and acceptor atoms using either SMILES or MOL/SDF inputs. Jazzy is easy to use, does not require expensive hardware, and produces accurate estimations within milliseconds to seconds for drug-like molecules. The library also exposes functionalities to depict molecules with atomistic hydrogen-bond strengths in two or three dimensions.
Code is on GitHub https://github.com/AstraZeneca/jazzy
And there is a really useful cookbook with examples. https://jazzy.readthedocs.io/en/latest/cookbook.html.
MayChem Tools Updated
More updates to the superb MayaChemTools. A new command line script named PyMOLExtractSelection.py to extract an arbitrary PyMOL selection from a macromolecule and write it out to a file. In addition, the Psi4CalculateEnergy.py and Psi4PerformMinimization.py scripts have been updated to perform these calculations in solution using domain-decomposition-based continuum solvation models. These scripts rely on Psi4 interface to the DDX module to perform the calculations. Two solvation models are supported: COnductor-like Screening MOdel (COSMO) and Polarizable Continuum Model (PCM). A number of enhancements have been made to the PyMOLVisualizeMacromolecules.py script including identification of arbitrary distance contacts between heavy atoms in pocket residues and docked poses, visualization of solvents and inorganics in the pocket around docked poses, and visualization of B factor values for chains.
Structure-based searching SQLite
I've been experimenting with SQLite a software library that provides a relational database management system, it is self-contained, serverless, and requires little or no admin.
SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. SQLite is the most used database engine in the world. SQLite is built into all mobile phones and most computers and comes bundled inside countless other applications that people use every day.
In th first tutorial I described looking at using it for very fast exact lookup of chemical structures. This tutorial https://www.macinchem.org/reviews/exactsearch/exactsearch.php takes you through setting up the database, storing chemical structures as SMILES strings and then accessing it using a Jupyter Notebook.
The second tutorial https://www.macinchem.org/reviews/exactsearch/usingexactsearch.php shows how to create a python script to access from the command line, and using AppleScript to access it from ChemDraw. This allows you to get the structure for a specific identifier or check for the identifier for a drawn structure.
The third tutorial https://www.macinchem.org/reviews/exactsearch/substructuresearch.php shows how to use the fabulous Chemicalite to support high performance chemical structure-based searching of a SQLite database of over 2 million structures.
Using SQLite for exact search
I've been experimenting with SQLite a software library that provides a relational database management system, it is self-contained, serverless, and requires little or no admin.
SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. SQLite is the most used database engine in the world. SQLite is built into all mobile phones and most computers and comes bundled inside countless other applications that people use every day.
In particular I've been looking at using it for very fast exact lookup of chemical structures. This tutorial https://www.macinchem.org/reviews/exactsearch/exactsearch.php takes you through setting up the database, storing chemical structures as SMILES strings and then accessing it using a Jupyter Notebook.
The second tutorial https://www.macinchem.org/reviews/exactsearch/usingexactsearch.php shows how to create a python script to access from the command line, and using AppleScript to access it from ChemDraw. This allows you to get the structure for a specific identifier or check for the identifier for a drawn structure.
RDKit updated
The latest version of RDKit has been released. https://github.com/rdkit/rdkit/releases/tag/Release202209_1.
- The new RegistrationHash module provides one of the last pieces required to build a registration system with the RDKit.
- This release includes an initial version of a C++ implementation of the xyz2mol algorithm for assigning bonds and bond orders based on atomic positions. This work was done as part of the 2022 Google Summer of Code.
- A collection of new functionality has been added to minimallib and is now accessible from JavaScript and other programming languages.
MayaChemTools update
The awesome http://www.mayachemtools.org/index.htmlMayaChemTools has a couple of new additions and updates.
Two new command line scripts:
- RDKitEnumerateTautomers.py http://www.mayachemtools.org/docs/scripts/html/RDKitEnumerateTautomers.html
- RDKitStandardizeMolecules.py http://www.mayachemtools.org/docs/scripts/html/RDKitStandardizeMolecules.html
In addition, the Psi4PerformTorsionScan.py and RDKitPerformTorsionScan.py scripts have been updated to optionally filter matched torsions by atom indices for performing torsion scans. A number of enhancements have been made to PyMOLVisualizeMacromolecules.py script including visualization of docked poses.
All scripts are listed here http://www.mayachemtools.org/docs/scripts/html/index.html.
RSC CICAG Open Source Tools for Chemistry :- Scoring of shape and ESP similarity (Ester Heid)
The latest of the RSC CICAG workshops is now online https://youtu.be/Ka08REoGYvI.
Electrostatic effects along with volume restrictions play a major role in enzyme and receptor recognition. Evaluating electrostatic and shape similarities of pairs of molecules such as proposed versus known ligands can therefore be valuable indicators of prospective binding affinities. This workshop will demonstrate how to compute electrostatic and shape similarities using the open-source tool ESP-Sim github.com/hesther/espsim, doi.org/10.26434/chemrxiv-2021-sqvv9-v3. Available options for comparing electrostatics will be discussed interactively on selected examples of public datasets, along with advice on embedding and aligning molecules prior to computing similarities.
Whilst comparing molecules using 1D or 2D descriptors is well known, most molecules are three dimensional, as are biomolecule binding sites. The comparison of molecular shapes and electrostatics is particularly challenging and this workshop is a perfect introduction. Come along and you have a chance to ask questions directly.
All materials are available on GitHub https://github.com/hesther/espsim/tree/master/workshop
Matched molecular pair database generation and analysis
Matched molecular pair analysis (MMPA) is a popular structure activity method in cheminformatics that compares the properties of two molecules that differ only by a single chemical transformation, (e.g. substitution of a hydrogen atom by a chlorine atom). Because the structural difference between the two molecules is small, any experimentally observed change in a physical or biological property between the matched molecular pair could be associated with this particular molecular transformation.
Andrew Dalke has recently published open source code to support this methodology https://github.com/adalke/mmpdb/tree/v3-dev.
To install
python -m pip install mmpdb
The package has been tested on Python 3.9.
You will need a copy of the RDKit cheminformatics toolkit, available from http://rdkit.org/ , which in turn requires NumPy. You will also need SciPy, peewee, and click. The latter three are listed as dependencies in setup.cfg and should be installed automatically.
Full details are described in this publication.
A. Dalke, J. Hert, C. Kramer. mmpdb: An Open-Source Matched Molecular Pair Platform for Large Multiproperty Data Sets. J. Chem. Inf. Model., 2018, 58 (5), pp 902–910. DOI.
Additions to MayaChemTools
A couple of new scripts have been added to the excellent MayaChemTools growing collection of Perl and Python scripts, modules, and classes to support a variety of day-to-day computational discovery needs.
RDKitFilterTorsionLibraryAlerts.py - Filter torsion library alerts Direct Link
And
Psi4CalculateInteractionEnergy.py - Calculate interaction energy Direct Link.
Open Source Antibiotics Structures
The OpenSourceAntibiotics project is a consortium of researchers interested in open ways to discover and develop new, inexpensive medicines for bacterial infections. All data is in the open and anyone can contribute. Whilst all data is on the wiki it can be tricky to sometimes link structure to identifier, in an effort to make these more accessible and hopefully indexed by search engines a page containing structures, identifiers, SMILES and InChiKey has been created.
You can view the page here https://opensourceantibiotics.github.io/murligase/CompChemTools/ForIndexing/OSA_data.html.
This page is updated nightly via a cron job. This calls a shell script that runs a Python script that reads the data from the master spreadsheet, uses RDKit to generate the images of the structures and create the html page. The shell script then uploads the html file to GitHub.
Hopefully the html page will be indexed by search engines which will allow anyone to search for the structures. Please feel free to share.
New additions to MayaChemTools
There have been a couple of new additions to the fabulous list of tools and scripts on MayaChemTools.
MayaChemTools is a growing collection of Perl and Python scripts, modules, and classes to support a variety of day-to-day computational discovery needs.
o Psi4GenerateConstrainedConformers.py http://www.mayachemtools.org/docs/scripts/html/Psi4GenerateConstrainedConformers.html>
o Psi4PerformConstrainedMinimization.py http://www.mayachemtools.org/docs/scripts/html/Psi4PerformConstrainedMinimization.html.
o Psi4PerformTorsionScan.py http://www.mayachemtools.org/docs/scripts/html/Psi4PerformTorsionScan.html.
These scripts rely on the presence of Psi4 https://psicode.org/ and RDKit in your environment. In addition, the script RDKitPerformTorsionScan.py
MayaChemTools is free software; you can redistribute it and/or modify it under the terms of the GNU LGPL as published by the Free Software Foundation.
RDKit UGM 2021 save the date: 14-15 October
A message from Greg Landrum
This year's RDKit UGM is going to take place October 14 and 15. It will, unfortunately, once again be a purely virtual event. Hopefully next year we will be able to travel again and all get together in one physical location, but this year it's not possible to really plan an in-person meeting.
Since it seemed to work well last time, we'll do a combination of zoom and either discord or some other text-based chat functionality and will have two sessions per meeting day: one earlier in the day which is easier for people in Asia to attend and one later in the day which is easier for people in the Americas.
Watch out for registration link in the next week or so.
RDKit blog
If you are a RDKit user then you should bookmark Greg Landrum's RDKit blog https://greglandrum.github.io/rdkit-blog/about/. This is a new site and all the old content will be migrated in due course.
AiZynthFinder: a fast, robust and flexible open-source software for retrosynthetic planning
This looks very interesting DOI.
We present the open-source AiZynthFinder software that can be readily used in retrosynthetic planning. The algorithm is based on a Monte Carlo tree search that recursively breaks down a molecule to purchasable precursors. The tree search is guided by an artificial neural network policy that suggests possible precursors by utilizing a library of known reaction templates. The software is fast and can typically find a solution in less than 10 s and perform a complete search in less than 1 min.
Source code is on GitHub https://github.com/MolecularAI/aizynthfinder.
Tested under macOS Catalina
Requires RDKit, Tensorflow, graphviz
Can then be installed using PIP.
The software is licensed under the MIT license
Jupyter Notebook for docking either locally or using Colab
Here are two variations of a Jupyter Notebook to help with docking experiments. The first version runs locally and requires the user to install RDKit, OpenBabel, SMINA and py3Dmol, the second version can be run using Google CoLab and thus all you require is a web browser.
Molecular Similarity Search Benchmark (MssBenchmark)
This looks like it could be a very useful resource.
Molecular Similarity Search Benchmark (MssBenchmark) on GitHub https://github.uconn.edu/mldrugdiscovery/MssBenchmark these can be run on your local machine or on a HPC.
Currently supports
- Balltree
- Bruteforce/Exhausive search
- Chemfp 1.6.1
- the standard modulo-OR-compression algorithm, or folding
- Min-Hash
- DivideSkip
- Hnsw
- Onng
- Panng
- Pynndescent
- Risc
- SW-graph
- VPtree
They also have ChEMBL and Molport as test datasets.
Requires
- ansicolors==1.1.8
- docker==2.6.1
- h5py==2.7.1
- matplotlib==2.1.0
- numpy==1.13.3
- pyyaml==3.12
- psutil==5.4.2
- scipy==1.0.0
- scikit-learn==0.19.1
- jinja2==2.10
- h5sparse==0.1.0
Online Events
The current global pandemic means that more events are moving online, here are details of a few that have been sent to me
Dotmatics User Symposium | Cambridge 2020 14th & 15th October Details and Registration.
KNIME Introduction to Working with Chemical Data October 12 - 16, 2020 details and registration.
Virtual RDKit UGM 6-8 October 2020 details and registration.
16th German Conference on Cheminformatics and EuroSAMPL Satellite Workshop 2-3 November 2020 details
Open Chemical Science 9 - 13 November 2020 details.
An open source chemical structure curation pipeline using RDKit
I just thought I'd flag a recent paper in Journal of Cheminformatics described "An open source chemical structure curation pipeline using RDKit" DOI. As anyone who has had to curate a molecular dataset knows standardising the chemical structures is one of the absolutely key elements of the process.
A chemical curation pipeline has been developed using the open source toolkit RDKit. It comprises three components: a Checker to test the validity of chemical structures and flag any serious errors; a Standardizer which formats compounds according to defined rules and conventions and a GetParent component that removes any salts and solvents from the compound to create its parent. This pipeline has been applied to the latest version of the ChEMBL database as well as uncurated datasets from other sources to test the robustness of the process and to identify common issues in database molecular structures.
The ChEMBLStructurePipeline is freely available on GitHub https://github.com/chembl/ChEMBLStructurePipeline/releases/tag/1.0.0.
Or using Anaconda
conda install -c conda-forge chembl_sructure_pipeline
New addition to MayaChemTools
I've just heard about a new addition to the superb MayaChemTools.
MayaChemTools is a growing collection of Perl and Python scripts, modules, and classes to support a variety of day-to-day computational discovery needs.
The latest addition RDKitPerformTorsionScan.py
Perform torsion scan for molecules around torsion angles specified using SMILES/SMARTS patterns. A molecule is optionally minimized before performing a torsion scan. A set of initial 3D structures are generated for a molecule by scanning the torsion angle across the specified range and updating the 3D coordinates of the molecule. A conformation ensemble is optionally generated for each 3D structure representing a specific torsion angle. The conformation with the lowest energy is selected to represent the torsion angle. An option is available to skip the generation of the conformation ensemble and simply calculate the energy for the initial 3D structure for a specific torsion angle
The torsions are specified using SMILES or SMARTS patterns. A substructure match is performed to select torsion atoms in a molecule. The SMILES pattern match must correspond to four torsion atoms. The SMARTS patterns containing atom indices may match more than four atoms.
The beta of the 2020.03 RDKit released
The beta of the 2020.03 RDKit is now available on GitHub https://github.com/rdkit/rdkit/releases/tag/Release202003_1b1.
Backwards incompatible changes:
- Searches for equal molecules (i.e. mol1 @= mol2) in the PostgreSQL cartridge now use the dochiralsss option. So if dochiralsss is false (the default), the molecules CC(F)Cl and C[C@H](F)Cl will be considered to be equal. Previously these molecules were always considered to be different.
- Attempting to create a MolSupplier from a filename pointing to an empty file, a file that does not exist or sometihing that is not a standard file (i.e. something like a directory) now generates an exception.
- The cmake option RDKOPTIMIZENATIVE has been renamed to RDKOPTIMIZEPOPCNT
Highlights:
- The drawings generated by the MolDraw2D objects are now significantly improved and can include simple atom and bond annotations
- An initial implementation of a modified scaffold network algorithm is now available
- A few new descriptor/fingerprint types are available - BCUTs, Morse atom fingerprints, Coulomb matrices, and MHFP and SECFP fingerprints
Plus lots of bug fixes.
Google Summer of Code Open Chemistry
The Open Chemistry Google Summer of Code will be open for proposals on March 16 2020.
Just enough time to have a look at the GSoC Ideas 2020 lots of opportunities to contribute and learn.
The Open Chemistry project is a collection of open source, cross platform libraries and applications for the exploration, analysis and generation of chemical data. The organization is an umbrella of projects developed by long-time collaborators and innovators in open chemistry such as the Avogadro, Open Babel, and cclib projects. These three alone have been downloaded over 1,000,000 times and cited in over 2,000 academic papers. Our goal is to improve the state of the art, and facilitate the open exchange of chemical data and ideas while utilizing the best technologies from quantum chemistry codes, molecular dynamics, informatics, analytics, and visualization.
If you are interested in contributing why not download the source code for one of the projects, have a play around and get familiar with the code.
ChemRPS a Chemical Registration and Publishing System
Whilst there are many commercial packages for creating structure searchable chemical databases there is little in the way of Open Source packages, in particular a solution that provides a web front end. There is the RDKit PostgreSQL cartridge however installing PostgreSQL and building the database is probably a step to far for those unfamiliar with the use of the command line.
I recently came across ChemRPS whilst this uses the same RDKit PostgreSQL cartridge a search engine (API) and a preconfigured webserver with register/search web pages including structure editor Ketcher from EPAM, the installation comes as a Docker image which should make things much easier.
The system had not been tested on a Mac so I've detailed the instructions in this review…
ChEMBL Compound Curation Pipeline
With the imminent release of ChEMBL 26 I was interested to hear about the new chemical curation pipeline that had been developed.
The pipeline includes three functions:
Check Identifies and validates problem structures before they are added to the database
Standardize Standardises chemical structures according to a set of predefined ChEMBL business rules
GetParent Generates parent structures of multi-component compounds based on a set of rules and defined list of salts and solvents
The code is all on GitHub https://github.com/chembl/ChEMBLStructurePipeline and notebooks are available.
Interactive plots in Jupyter Notebooks updated
I've been using Jupyter notebooks for a while for a wide variety of projects.
I've been looking at ways to produce interactive plots within a Jupyter notebook and after trying a couple of options to produce interactive data frames, in addition to 2D and 3D scatterplots including structures on tooltips.
Full review and the Jupyter notebook are here.
Interactive plots in Jupyter notebooks
I've been looking at ways to produce interactive plots within a Jupyter notebook and after trying a couple of options I used Plotly. This seems fairly straight-forward to use and I can produce interactive data frames, in addition to 2D and 3D scatterplots.
More details are shown here together with the jupyter notebook. It is very much a work in progress and suggestions are welcome. In particular, whilst I can get text to appear when hovering over a data point I'd be interested in ideas of how to get the structure displayed when you mouse over a point.
Ensemble learning in Cheminformatics
Yet another invaluable post on cheminformatics and machine learning Python package for Ensemble learning #Chemoinformatics #Scikit learn.
Ensemble learning sometime outperform than single model. So it is useful for try to use the method. Fortunately now we can use ensemble learning very easily by using a python package named ‘mlens‘
Install using PIP
pip install mlens
ML-Ensemble (mlens) is an open-source high performance ensemble learning package written in Python, code is available on GitHub https://github.com/flennerhag/mlens.
ML-Ensemble combines a Scikit-learn high-level API with a low-level computational graph framework to build memory efficient, maximally parallelized ensemble networks in as few lines of codes as possible.
RDKit 2019_09_1 (Q3 2019) Release
A new version of RDKit has been released https://github.com/rdkit/rdkit/releases/tag/Release201909_1.
Highlights:
- The substructure matching code is now about 30% faster. This also improves the speed of reaction matching and the FMCS code.
- A minimal JavaScript wrapper has been added as part of the core release.
- It's now possible to get information about why molecule sanitization failed.
- A flexible new molecular hashing scheme has been added.
There are however a number of backward incompatible changes detailed in the documents.
Also the old MolHash code should be considered deprecated. This release introduces a more flexible alternative.
Binaries have been uploaded to anaconda.org (https://anaconda.org/rdkit). The available conda binaries for this release are:
- Linux 64bit: python 3.6, 3.7
- Mac OS 64bit: python 3.6, 3.7
- Windows 64bit: python 3.6, 3.7
Some things that will be finished over the next couple of days:
- The conda build scripts will be updated to reflect the new version
- The homebrew script
Installing RDKit using Homebrew
I just saw this message on the RDKit users message board which offers a method to install RDKit using Homebrew, I use Anaconda to install RDKit so I've not tested it.
Recently, I updated the brew install recipe for rdkit on Mac. The biggest change is that boost and boost-python's versions were pinned down, so that the brew install recipe should be much more reproducible than before. Here is a fail-safe way to install rdkit with it (with Python wrappers, and InChI support):
I've added the instructions to the Cheminformatics on a Mac page as an alternative to using Anaconda to install RDKit.
The RDKit is an open source toolkit for cheminformatics, 2D and 3D molecular operations, descriptor generation for machine learning, etc.
Crowdfunding software development
Some time ago I wrote a piece on my thoughts on scientific software development I got a lot of very positive feedback and one of the comments about not knowing about available cheminformatics toolkits lead me to create a page on open source toolkits. However this really did not address the underlying problem of how to fund specialist scientific software.
Which is why I was intrigued to hear about Andrew Dalke's efforts to crowdfund development of an open source cheminformatics software development.
This is an experiment to see if a crowdfunding consortium can be used to fund the matched molecular pair program “mmpdb”. The deadline to join is 1 February 2020!
The project is mmpdb, initial work was described in and article in JCIM "mmpdb: An Open-Source Matched Molecular Pair Platform for Large Multiproperty Data Sets" DOI.
Here we present mmpdb, an open-source matched molecular pair (MMP) platform to create, compile, store, retrieve, and use MMP rules. mmpdb is suitable for the large data sets typically found in pharmaceutical and agrochemical companies and provides new algorithms for fragment canonicalization and stereochemistry handling. The platform is written in Python and based on the RDKit toolkit.
Go over to the project page http://mmpdb.dalkescientific.com to find out more and if you can contribute please do, and also please share the link. He will be talking at the RDKit UGM #rdkitugm2019 and the presentation will probably be online later.
Determining the Amino Acids in a collection of peptides
I've recently become interested the comparison of the amino amino-acid composition of peptides, to allow comparison of cyclic versus linear peptides, or brain penetrant curses non-penetrant. I had a look around but could not find any tools that did this, in particular I wanted to include any non-proteinergic amino-acids.
This tutorial provides a means to analyse many thousands of peptides using Vortex.
OraRdkitCart an Oracle data cartridge
OraRdkitCart is an Oracle data cartridge/extensible index to allow substructure and similarity searching using SQL queries on tables which contain indexed chemical structures.
It uses a Java RMI server and RDKit wrappers for chemical structure handling.
The cartridge has been tested on Oracle 12C and Oracle 18C. It would be expected to run on Oracle 19C, but has not yet been tested.
Full details on GitHub https://jones-gareth.github.io/OraRdKitCart/index.html
Greg Landrum's ACS talk on RDKit
Novartis Open Source tools for Drug Discovery
I'm sure most readers of this site are aware of the Open-Source cheminformatics toolkit RDKit that was first developed in Novartis. However I wonder how many are aware of the other Open-Source tools that Novartis have supported.
You can read more about them here
The Novartis Institutes for BioMedical Research (NIBR) is pioneering new informatics tools for drug discovery. We believe in the power of open-sourced, global collaboration for the greater good. Join us to help patients worldwide.
They are available on GitHub here.
They include Habitat an object management system, OntoBrowser a tool to manage ontologies and controlled terminologies. YAP is an extensible parallel framework, written in Python using OpenMPI libraries, and GridVar a jQuery plugin that visualises multi-dimensional datasets as layers organised in a row-column format
An interactive RDKit widget for Jupyter: a first pass
This looks like it could be very interesting.
A blog post by Greg Landrum a widget for displaying molecules where you can select atoms and find out which atoms are selected propagating to Python in a Jupyter Notebook.
This is basic, but I think it's a decent start towards something that could be really useful. Interested? Have suggestions (ideally accompanied by code!) on how to improve it? If it looks like this is actually likely to be used, I will figure out how to create a standalone nbwidget out of this and create a separate github repo for it.
Looks like a useful tool for selecting bonds for conformational analysis, selecting bonds for creating a Ramachandran plot, selecting groups for bioisosteric replacement……
Sounds like Greg is looking for input.
Jupyter notebook to look at molecular similarity
I was recently asked for a tool to compare the similarity of a list of molecules with every other molecule in the list. I suspect there may be commercial tools to do this but for small numbers of compounds it is easy to visualise in a Jupyter notebook using RDKit.
Read more here, MolecularSimilarityNotebook
Openforcefield
The Open Force Field Initiative is an open source, open science, and open data approach to better force fields. All the code is on GitHib and they also provide highly curated datasets.
The idea is to enable molecular mechanics on small and macromolecules jointly using open and freely available software.
A recent blog post from Peter Schmidtke caught my eye.
Recently a few updates of the openforcefield toolkit came out … a game changer, as you’ll see.
The work investigated whether the 768 fragments from the XChem fragment library at Diamond can be parametrised with the new version of Open Force Field (0.4) and how they behave after a simple minimisation.
In short all fragments technically pass the parametrisation and minimisation step, this was supported by visual inspection.
All the code is on GitHub.
NextMove open source MolHash
MolHash is a command-line application and programming library for generating hashes from molecular structures. This section gives an overview of each of the most useful hash functions in turn. The user should find it straightforward to add additional hash functions, or tweak the existing ones.
The source code is available on GitHub https://github.com/nextmovesoftware/molhash.
CMAKE, RDKit and Boost are required.
There are detailed instructions on GitHub describing the compilation and installation instructions, but I got several errors asking where RDKit was etc.
Fortunately, thanks to Matt, you can now install using conda
conda install -c mcs07 -c conda-forge molhash
Once installed you can check it is working by typing this in the Terminal
MacPro:username$ molhash -help
usage: molhash [options] <infile> [<outfile>]
Use a hyphen for <infile> to read from stdin
options:
-a Process all the molecule (and not just the single largest component)
-sa Suppress atom stereo
-sb Suppress bond stereo
-sh Suppress explicit hydrogens
-si Suppress isotopes
-sm Suppress atom maps
-t Store titles only
hash type:
-g anonymous graph [default]
-e element graph
-s canonical smiles
-m Murcko scaffold
-mf molecular formula
-ab atom and bond counts
-dv degree vector
-me mesomer
-ht hetatom tautomer
-hp hetatom protomer
-rp redox-pair
-ri regioisomer
-nq net charge
An example of usage
MacPro:username$ echo "c1ccccc1C(=O)Cl" | molhash -mf -
C7H5ClO c1ccc(cc1)C(=O)Cl
End of the line for Python 2
Just a reminder that support for Python 2.7 will end on Jan 31 2020 (there will be no 2.8), all major scientific packages now support Python 3.x and there will be no further updates the Python 2.x versions.
An increasing number of projects have pledged to drop support for Python 2.7 no later than 2020, these include pandas, RDKit, iPython, Matplotlib, NumPy, SciPy, BioPython, Psi4, scikit-learn, Tensorflow, Jupyter notebook and many more.
Time to update those old scripts and Jupyter notebooks.
HELM notation in Jupyter Notebook
I was recently asked for a way to visualise HELM notation
HELM (Hierarchical Editing Language for Macromolecules) enables the representation of a wide range of biomolecules such as proteins, nucleotides, antibody drug conjugates etc. whose size and complexity render existing small-molecule and sequence-based informatics methodologies impractical or unusable.
The RDKit provides limited support for HELM notation (currently peptide) and a simple Jupyter Notebook provides an easy interface as shown here
A review of alvaDesc
alvaDesc is a desktop tool for the calculation of a wide range of molecular descriptors and a number of molecular fingerprints from https://www.alvascience.com. alvaDesc can be used to determine over 5000 different descriptors (the full list is here).
It can be accessed via the command line or via a GUI.
A Quick look at Flare and Python
I recently wrote a review of Flare Version 2 which is a recent extension to the Cresset portfolio with the introduction of Electrostatic Complementarity (EC), i.e. a comparison of electrostatics on both the small molecule ligand and the target protein. In addition Flare version 2 includes a new Python API, that allows users to automate tasks by scripting, but also integration with other Python packages such as RDKit cheminformatics toolkit, Python modules for graphing, statistics (NumPy, SciPy, MatPlotLib), and Jupyter notebook integration, it is this aspect of Flare that is the subject of this review.
Chembience updated
Update to RDKit 2018.09.2 and Postgres 10.7.
Chembience is a Docker based platform supporting the fast development of chemoinformatics-centric web applications and microservices. It creates a clean separation between your scientific web service implementation and any host-specific or infrastructure-related configuration requirements.
Update to MayaChemTools
I just heard that the following command line scripts available as part of MayaChemTools package now have implemented multiprocessing functionality.
o RDKitCalculateMolecularDescriptors.py
o RDKitCalculatePartialCharges.py
o RDKitGenerateConformers.py
o RDKitFilterChEMBLAlerts.py
o RDKitFilterPAINS.py
o RDKitPerformMinimization.py
o RDKitRemoveSalts.py
o RDKitSearchSMARTS.py
New release of MayaChemTools
A new release of MayaChemTools is now available, these comprise a fantastic collection of Perl and Python scripts, modules, and classes to support a variety of day-to-day computational discovery needs.
The core set of command line Perl scripts available in the current release of MayaChemTools has no external dependencies and provide functionality for the following tasks:
- Manipulation and analysis of data in SD, CSV/TSV, sequence/alignments, and PDB files
- Listing information about data in SD, CSV/TSV, Sequence/Alignments, PDB, and fingerprints files
- Calculation of a key set of physicochemical properties, such as molecular weight, hydrogen bond donors and acceptors, logP, and topological polar surface area
- Generation of 2D fingerprints corresponding to atom neighborhoods, atom types, E-state indices, extended connectivity, MACCS keys, path lengths, topological atom pairs, topological atom triplets, topological atom torsions, topological pharmacophore atom pairs, and topological pharmacophore atom triplets
- Generation of 2D fingerprints with atom types corresponding to atomic invariants, DREIDING, E-state, functional class, MMFF94, SLogP, SYBYL, TPSA and UFF
- Similarity searching and calculation of similarity matrices using available 2D fingerprints
- Listing properties of elements in the periodic table, amino acids, and nucleic acids
- Exporting data from relational database tables into text files
The command line Python scripts based on RDKit provide functionality for the following tasks:
- Calculation of molecular descriptors and partial charges
- Comparison of 3D molecules based on RMSD and shape
- Conversion between different molecular file formats
- Enumeration of compound libraries and stereoisomers
- Filtering molecules using SMARTS, PAINS, and names of functional groups
- Generation of graph and atomic molecular frameworks
- Generation of images for molecules
- Performing structure minimization and conformation generation based on distance geometry and forcefields
- Performing R group decomposition
- Picking and clustering molecules based on 2D fingerprints and various clustering methodologies
- Removal of duplicate molecules and salts from molecules
The command line Python scripts based on PyMOL provide functionality for the following tasks:
- Aligning macromolecules
- Splitting macromolecules into chains and ligands
- Listing information about macromolecules
- Calculation of physicochemical properties
- Comparison of marcromolecules based on RMSD
- Conversion between different ligand file formats
- Mutating amino acids and nucleic acids
- Generating Ramachandran plots
- Visualizing X-ray electron density and cryo-EM density
- Visualizing macromolecules in terms of chains, ligands, and ligand binding pockets
- Visualizing cavities and pockets in macromolecules
- Visualizing macromolecular interfaces
- Visualizing surface and buried residues in macromolecules
GuacaMol, benchmarking models.
Comparison of different algorithms is an under researched area, this publication looks like a useful starting point.
GuacaMol: Benchmarking Models for De Novo Molecular Design
De novo design seeks to generate molecules with required property profiles by virtual design-make-test cycles. With the emergence of deep learning and neural generative models in many application areas, models for molecular design based on neural networks appeared recently and show promising results. However, the new models have not been profiled on consistent tasks, and comparative studies to well-established algorithms have only seldom been performed. To standardize the assessment of both classical and neural models for de novo molecular design, we propose an evaluation framework, GuacaMol, based on a suite of standardized benchmarks. The benchmark tasks encompass measuring the fidelity of the models to reproduce the property distribution of the training sets, the ability to generate novel molecules, the exploration and exploitation of chemical space, and a variety of single and multi-objective optimization tasks. The benchmarking framework is available as an open-source Python package.
Source code : https://github.com/BenevolentAI/guacamol.
The easiest way to install guacamol is with pip:
pip install git+https://github.com/BenevolentAI/guacamol.git#egg=guacamol --process-dependency-links
guacamol requires the RDKit library (version 2018.09.1.0 or newer).
An automated framework for NMR chemical shift calculations of small organic molecules
A recent paper in Journal of Cheminformatics describes An automated framework for NMR chemical shift calculations of small organic molecules DOI.
As an alternative, we introduce the in silico Chemical Library Engine (ISiCLE) NMR chemical shift module to accurately and automatically calculate NMR chemical shifts of small organic molecules through use of quantum chemical calculations. ISiCLE performs density functional theory (DFT)-based calculations for predicting chemical properties—specifically NMR chemical shifts in this manuscript—via the open source, high-performance computational chemistry software, NWChem.
Isicle is available from GitHub https://github.com/pnnl/isicle or can be installed using Conda (with required dependencies
conda create -n isicle -c bioconda -c openbabel -c rdkit -c ambermd python=3.6.1 openbabel rdkit ambertools snakemake numpy pandas yaml statsmodels
In addition, ensure the following third-party software is installed and added to your PATH:
cxcalc (license required from ChemAxon, Marvin)
NWChem http://www.nwchem-sw.org/index.php/Download.
ISiCLE is implemented using the Snakemake workflow management system, enabling scalability, portability, provenance, fault tolerance, and automatic job restarting. Snakemake provides a readable Python-based workflow definition language and execution environment that scales, without modification, from single-core workstations to compute clusters through as-available job queuing based on a task dependency graph.
There is more details on Snakemake here.
I've added Isicle to the Spectroscopy Page.
How to contribute to RDKit
I just noticed that Greg Landrum has posted a page on how to contribute to RDKit. https://github.com/rdkit/rdkit/wiki/HowToContribute.
There many ways to contribute, you don't have to be Python or C++ developer, simply being an active user and asking questions and contributing solutions helps other users. Improving the documentation is always a great place from newcomers to start, particularly highlighting things that are not as clear as they could be.
I've also added the link to the Toolkits page.
Install RDKiit using Conda
Just highlighted on the RDKit email list, you can install RDKit using conda.
https://anaconda.org/conda-forge/rdkit
RDKit is a collection of cheminformatics and machine-learning software written in C++ and Python.
There are other cheminformatics toolkits described here, and details on how to install a wide range of cheminformatics tools on a Mac detailed here
Installing Cheminformtics packages on a Mac
A while back I wrote a very popular page describing how to install a wide variety of chemiformatics packages on a Mac, since there have been some changes with Homebrew which have meant that a few of the scientific applications are no longer available so I've decided to rewrite the page on installing the missing packages using Anaconda.
I've also included a list of quick demos so you can everything is working as expected.
Packages include:
- OpenBabel
- RDKit
- brew install cdk
- chemspot
- indigo
- inchi
- opsin
- osra
- pymol
- oddt
In addition to gfortran and a selection of developers tools.
Chembience
Chembience is a Docker based platform intended for the fast development of chemoinformatics-centric web applications and micro-services based on RDkit. It supports a clean separation of your scientific web service implementation work from any infrastructure related configuration requirements.
At its current development stage, Chembience supports three base types of application (App) containers: (1) a Django/Django REST framework-based App container which is specifically suited for the development of web-based Python applications, (2) a Python shell-based App container which allows for the execution of script-based python applications, and (3), a Jupyter-based App container which let you run Jupyter notebooks (currently only a Python kernel is supported).
Updated Conda
I've been checking a few things since I updated. One thing that was immediately apparent was the similarity maps in RDKit are much nicer! As you can see from the output of the HERG prediction.
Feel like I got something for free.
Accessing a Jupyter Notebook HERG model from Vortex
A recent paper "The Catch-22 of Predicting hERG Blockade Using Publicly Accessible Bioactivity Data" DOI described a classification model for HERG activity. I was delighted to see that all the datasets used in the study, including the training and external datasets, and the models generated using these datasets were provided as individual data files (CSV) and Python Jupyter notebooks, respectively, on GitHub https://github.com/AGPreissner/Publications).
The models were downloaded and the Random Forest Jupyter Notebooks (using RDKit) modified to save the generated model using pickle to store the predictive model, and then another Jupyter notebook was created to access the model without the need to rebuild the model each time. This notebook was exported as a python script to allow command line access, and Vortex scripts created that allow the user to run the model within Vortex and import the results and view the most significant features.
All models and scripts are available for download.
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
Greg Landrum's ICCS 2018 presentation on slideshare
Google Sumer of code, Open Chemistry Projects
The details of some of the projects taking part in the Google Summer of Code are now online here https://summerofcode.withgoogle.com/organizations/6513013473935360/ under the Open Chemistry header.
Really interesting work includes 3-D coordinate generation, standardising fingerprint APIs, a framework for molecular validation, and standardization and molecular dynamics in Avogadro.
Good luck to all that are taking part!!
RDKit code changes
I just saw this on the RDKit email circulation list and since I know a number of readers use RDKit I thought I'd mention it.
When we do the beta for the 2018.03.1 release we're going to switch the C++ backend to use modern C++ (=C++11). For people who can't switch to use that code, we will continue to provide bug fixes for the 2017.09 release for at least another 6 months.
This should only affect people who need to build the RDKit C++ code themselves. If you use a binary version of the RDKit like the ones available inside of Anaconda Python or KNIME, this change should have no impact upon you.
It looks like we're almost there. Hopefully we will be able to do a beta of the 2018.03 release by the end of the week.
RDkit in Samson
I've posted about Samson a couple of times and it just keeps getting better and better.
SAMSON is a novel software platform for computational nanoscience. Rapidly build models of nanotubes, proteins, and complex nanosystems. Run interactive simulations to simulate chemical reactions, bend graphene sheets, (un)fold proteins. SAMSON's generic architecture makes it suitable for material science, life science, physics, electronics, chemistry, and even education. SAMSON is developed by the NANO-D group at INRIA, and means "Software for Adaptive Modeling and Simulation Of Nanosystems.
A recent blog post highlights the use of RDKit in Samson.
In this post I will present you the RDKit-SMILES Manager module that I integrated in the SAMSON platform. As some of you know, RDKit is an open source toolkit for cheminformatics which is widely used in the bioinformatics research. One of its features is the conversion of molecules from their SMILES code to a 2D and 3D structures. Thanks to the new SAMSON Element, it is now possible to use these features in the SAMSON platform. SMILES code files (.smi) or text files (.txt) containing several SMILES codes can be read using the import button.
The new module allows you to import a file containing SMILES strings, generate 2D depictions, and by right-clicking on these images, you can open, generate the 3D structure in SAMSON or save the image as png or svg.
It is also possible to run substructure searching using SMARTS.
mmpdb: An Open Source Matched Molecular Pair Platform for Large Multi-Property Datasets
An interesting paper on chemrxiv DOI
Matched Molecular Pair Analysis (MMPA) enables the automated and systematic compilation of medicinal chemistry rules from compound/property datasets. Here we present mmpdb, an open source Matched Molecular Pair (MMP) platform to create, compile, store, retrieve, and use MMP rules. mmpdb is suitable for the large datasets typically found in pharmaceutical and agrochemical companies and provides new algorithms for fragment canonicalization and stereochemistry handling. The platform is written in Python and based on the RDKit toolkit. It is freely available from https://github.com/rdkit/mmpdb
Google Summer of Code:- Open Chemistry
There are a number of interesting projects being undertaken in this years Google Summer of Code.
If you know of any students that might be interested then perhaps point them to the Open Chemistry Project.
The Open Chemistry project is a collection of open source, cross platform libraries and applications for the exploration, analysis and generation of chemical data. The organization is an umbrella of leading projects developed by long-time collaborators and innovators in open chemistry such as the Avogadro, Open Babel, and cclib projects. These three alone have been downloaded over 700,000 times and cited in over 2,000 academic papers. Our goal is to improve the state of the art, and facilitate the open exchange of chemical data and ideas while utilizing the best technologies from quantum chemistry codes, molecular dynamics, informatics, analytics, and visualization.
There is a list of the GSoC Ideas 2018 here but of course students can add their own.
MayaChem Tools
MayaChemTools is a fabulous collection of Perl and Python scripts, modules, and classes to support a variety of day-to-day computational discovery needs.
The core set of command line Perl scripts available in the current release of MayaChemTools has no external dependencies and provide functionality for the following tasks:
- Manipulation and analysis of data in SD, CSV/TSV, sequence/alignments, and PDB files
- Listing information about data in SD, CSV/TSV, Sequence/Alignments, PDB, and fingerprints files
- Calculation of a key set of physicochemical properties, such as molecular weight, hydrogen bond donors and acceptors, logP, and topological polar surface area
- Generation of 2D fingerprints corresponding to atom neighborhoods, atom types, E-state indices, extended connectivity, MACCS keys, path lengths, topological atom pairs, topological atom triplets, topological atom torsions, topological pharmacophore atom pairs, and topological pharmacophore atom triplets
- Generation of 2D fingerprints with atom types corresponding to atomic invariants, DREIDING, E-state, functional class, MMFF94, SLogP, SYBYL, TPSA and UFF
- Similarity searching and calculation of similarity matrices using available 2D fingerprints
- Listing properties of elements in the periodic table, amino acids, and nucleic acids
- Exporting data from relational database tables into text files
The command line Python scripts based on RDKit provide functionality for the following tasks:
- Calculation of molecular descriptors
- Comparison 3D molecules based on RMSD and shape
- Conversion between different molecular file formats
- Enumeration of compound libraries and stereoisomers
- Filtering molecules using SMARTS, PAINS, and names of functional groups
- Generation of graph and atomic molecular frameworks
- Generation of images for molecules
- Performing structure minimization and conformation generation based on distance geometry and forcefields
- Picking and clustering molecules based on 2D fingerprints and various clustering methodologies
- Removal of duplicate molecules
These invaluable scripts can be used in other applications, I've written a Vortex Script that uses them.
“Found in Translation”: Predicting Outcomes of Complex Organic Chemistry Reactions
An interesting paper uses 1,808,938 reactions from the patent literature as a training set to build a model to predict reactions.
There is an intuitive analogy of an organic chemist's understanding of a compound and a language speaker's understanding of a word. Consequently, it is possible to introduce the basic concepts and analyze potential impacts of linguistic analysis to the world of organic chemistry. In this work, we cast the reaction prediction task as a translation problem by introducing a template-free sequence-to-sequence model, trained end-to-end and fully data-driven. We propose a novel way of tokenization, which is arbitrarily extensible with reaction information. With this approach, we demonstrate results superior to the state-of-the-art solution by a significant margin on the top-1 accuracy. Specifically, our approach achieves an accuracy of 80.1% without relying on auxiliary knowledge such as reaction templates. Also, 66.4% accuracy is reached on a larger and noisier dataset.
There is also a brief video describing the work.
RDKit conformer generation script
Pharmacelera we have written a python script to generate conformations with RDKit and made it available here .
Conformer generation is one of the first and most important steps in most ligand based experiments, particularly when the ligand’s 3D structure is unknown. For example, the quality of the conformers could affect the results of virtual screening experiments.
Rdkit warning
I just saw this message on the rdkit mailing list and I thought I'd flag it.
I've noticed a problem with anaconda python on the Mac. This may also be a problem on linux, but I haven't tested that yet.
Due to some changes in the way the anaconda team is doing python builds, the most recent conda python builds seem to no longer work with the RDKit. The symptom is an error message like "Fatal Python error: PyThreadState_Get: no current thread" when you try to import the rdkit.
I've observed this for the newest 3.5 (3.5.4-hf91e95415) and 3.6 (3.6.2-hd0bf7f115) builds. A workaround is to downgrade to 3.5.3 (conda install python=3.5.3) or 3.6.1 (conda install python=3.6.1).
RDKit and Python3
Greg Landrum posted the following to the RDKit users and since a couple of the Jupyter Notebooks I've published make extensive use of RDKit I thought I'd flag it.
As many of you are no doubt aware, the Python community plans to discontinue support for Python 2 in 2020. A growing number of projects in the Scientific Python stack are making the same transition and have made that explicit here: http://www.python3statement.org/
I will be adding the RDKit to this list. The RDKit will switch to support only Python 3 by 2020. At some point between now and then - likely during the 2018.09 release cycle - we will create a maintenance branch for Python 2 that will continue to get bug fixes but will no longer have new Python features added. This branch will be maintained, and we will keep doing Python 2 builds, until 2020 when official Python 2 support ends.
Additionally, starting during the 2018.03 release cycle we will accept contributions for new features that are not compatible with Python 2 as long as those features are implemented in such a way that they don't break existing Python 2 code (more on this later). This will allow members of the RDKit community who have made the switch to Python 3 to start making use of the new features of the language in their RDKit contributions.
If you have not made the switch yet to Python 3: please read the web page I link to above and take a look at the list of projects that have committed to transition. The switch from Python 2 to Python 3 isn't always easy, but it's not getting any easier with time and you have a few years to complete it. There are a lot of online resources available to help.
Best Regards, -greg
The list of projects that will be making the transition so far includes; IPython, Jupyter notebook, pandas, Matplotlib SymPy, Astropy, Software Carpentry, SunPy xonsh, scikit-bio, PyStan, Axelrod osBrain, PyMeasure, rpy2, PyMC3, FEniCS, An Introduction to Applied Bioinformatics, music21, QIIME, Altair, gala, cual-id, CIS
Conformer generation
The generation of multiple conformations is an important step in a number of operations from input to ab initio calculations to providing input files for docking studies. A recent paper compared seven freely available conformer ensemble generators: Balloon (two different algorithms), the RDKit standard conformer ensemble generator, the Experimental-Torsion basic Knowledge Distance Geometry (ETKDG) algorithm, Confab, Frog2 and Multiconf-DOCK DOI, and also provided a dataset of ligand conformations taken from the PDB.
A recent twitter discussion involving Greg Landrum and David Koes prompted Greg to publish a blog post describing conformation generation within RDKit. The post compares using distance geometry to select diverse conformations versus an approach that combines the distance geometry approach with experimental torsion-angle preferences obtained from small-molecule crystallographic data (ETKDG). He also looks at the impact of force-field minimisation.
A really interesting read with code provided.
RDkit and Conda install of postgres cartridge on Mac OS
There has been an interesting discussion about installing rdkit-postgresql95 on Mac OS X on the rdkit mailing list and I thought it might be of wider interest.
Here's the resolution of the difficulties I was having installing rdkit-postgresql95 on Mac OS X. The problem turned out to be that the package originally posted used Py3.5, and I'm still using 2.7. I may change to 3.5 at some point, but Greg was kind enough to add a 2.7 version of the package.
So, the following invocations work to set up rdkit with the cartridge in a new env on Mac OS X. I'm on El Capitan, by the way, and for clarity, I've not tested the installation, but only checked that it completed successfully.
conda create -n rdk1 -c rdkit rdkit
. activate rdk1
conda install -c greglandrum rdkit-postgresql95
(The last command also installs postgresql 9.5.4-0.)
iPython Notebook issue
I’ve just been made aware of an issue with one of the Calculated properties iPython Notebook.
The latest update to Pandas
the respective piece of the pandas API got restructured for 0.18.1 and that the “format" module got moved from pandas.core to pandas.formats:
The consequence is that PandasTools now raises an error on attempting to import molecules into a data frame.
from rdkit.Chem import PandasTools
df = PandasTools.LoadSDF("demo.sdf")
AttributeError Traceback (most recent call last)
/Users/philopon/mysrc/python/mordred/.direnv/python-3.5.1/lib/python3.5/site-packages/IPython/core/formatters.py in __call__(self, obj)
341 method = _safe_get_formatter_method(obj, self.print_method)
342 if method is not None:
--> 343 return method()
344 return None
345 else:
/Users/philopon/mysrc/python/mordred/.direnv/python-3.5.1/lib/python3.5/site-packages/pandas/core/frame.py in _repr_html_(self)
566
567 return self.to_html(max_rows=max_rows, max_cols=max_cols,
--> 568 show_dimensions=show_dimensions, notebook=True)
569 else:
570 return None
/usr/local/Cellar/rdkit-python/2016.03.1/lib/python3.5/site-packages/rdkit/Chem/PandasTools.py in patchPandasHTMLrepr(self, **kwargs)
129 Patched default escaping of HTML control characters to allow molecule image rendering dataframes
130 '''
--> 131 formatter = pd.core.format.DataFrameFormatter(self,buf=None,columns=None,col_space=None,colSpace=None,header=True,index=True,
132 na_rep='NaN',formatters=None,float_format=None,sparsify=None,index_names=True,
133 justify = None, force_unicode=None,bold_rows=True,classes=None,escape=False)
AttributeError: module 'pandas.core' has no attribute 'format'
At the moment the only solution is to make sure you are using Pandas version 0.18.0
pip uninstall pandas
pip install pandas==0.18.0
SAR visualization with RDKit
One of the issues for machine learning models in helping understand structure activity relationships (SAR) is providing a nice chemist friendly visualisation. This excellent blog post provides a description of how to colour code the parts of molecules that are predicted to contribute to an activity.
RDkit updated
RDkit has been updated .
If you used home-brew to install RDkit as described here updating is very simple
brew update
brew upgrade rdkit
You can check which version you have installed using
MacPro> python
Python 2.7.11 (default, Dec 23 2015, 16:11:50)
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from rdkit import rdBase
>>> print rdBase.rdkitVersion
2016.03.1
>>>
iPython Notebook to calc physicochemical properties
I've been making increasing use of iPython notebooks, both as a way to perform calculations but also as a way of cataloging the work that I've been doing. One thing I seem to be doing quite regularly is calculating physicochemical properties for libraries of compounds and then creating a trellis of plots to show each of the calculated properties. In the past I've done this with a series of applescripts using several applications. This seemed an ideal task to try out using an iPython notebook.
Chemical similarity search in MongoDB
MongoDB (from "humongous") is an open-source object orientated document database.
Classified as a NoSQL database, MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster.
As you might expect chemical searching is not something that is traditionally supported, but there have been a couple of blog articles describing initial efforts, and there is now a detailed step by step description available. The post described implementation of chemical similarity searching using MongoDB and RDKit fingerprints it also has some initial comparisons with the more traditional SQL implementation using the RDKit PostgreSQL cartridge.
FMCS 1.0 - Find Maximum Common Substructure
Andrew Dalke has just released fmcs-1.0. It finds a maximum common substructure of two or more structures. Some of the features are:
- handles 1,000s of structures
- several different atom and bond comparison schemes
- modifiers to require ring bonds only match ring bonds, or that incomplete rings are not allowed in the MCS
- user-defined atom class typing through isotope labels (SMILES) or through an SD tag field
- uses an exact solution to find a maximum common substructure
- eports the current best solution if the timeout is reached
The software is distributed under the 2-clause BSD license and available for no charge from https://bitbucket.org/dalke/fmcs/downloads/fmcs-1.0.tar.gz
You must have the Python bindings to RDKit in order to run fmcs.
Usage details are in the README, shown also in the project page at: https://bitbucket.org/dalke/fmcs/