Macs in Chemistry

Insanely Great Science

Cluster mols

 

cluster_mols is a PyMOL plugin that allows the user to quickly select compounds from a virtual screen to be purchased or synthesized.

900px-Cluster_mols_py_pymol

The most up to date version (recommended) of clustermols is available through BitBucket at: https://bitbucket.org/mpb21/clustermols_py/overview

This plugin has a number of dependencies that are required. And it is currently only supported on Linux and OSX.

Baumgartner, Matthew (2016) IMPROVING RATIONAL DRUG DESIGN BY INCORPORATING NOVEL BIOPHYSICAL INSIGHT. Doctoral Dissertation, University of Pittsburgh.


Comments

FreeSASA

 

FreeSASA is a command line tool, C-library and Python module for calculating solvent accessible surface areas (SASA).

The Read Me gives download, build and installation instructions, in addition it details how to build the Python interface.

Simon Mitternacht (2016) FreeSASA: An open source C library for solvent accessible surface area calculations. F1000Research 5:189. DOI


Comments

Scripting PubMed searches

 

PubMed comprises more than 24 million citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full-text content from PubMed Central and publisher web sites. They also provide a number of programming tools that allow access to the information, E-utilities are a set of server-side programs that provide a stable interface into the Entrez query and database system.

To access these data, a piece of software first posts an E-utility URL to NCBI, then retrieves the results of this posting, after which it processes the data as required. The software can thus use any computer language that can send a URL to the E-utilities server and interpret the XML response; examples of such languages are Perl, Python, Java, and C++.

A while back I wrote a vortex script that helps with these sort of searches if you have multiple terms you want to search. I've updated this script to incorporate the changes requiring api keys to allow multiple requests to the E-utilities api, and I've highlighted where you need to add your own api key in the script. I've also tried to ensure that any query string should be encoded to make it URL safe.

The update is detailed more fully here….

tut25result


Comments

Downloading from the RCSB Protein Data Bank using Python

 

The RCSB Protein Data Bank is an absolutely invaluable resource that provides archive-information about the 3D shapes of proteins, nucleic acids, and complex assemblies that helps scientists understand all aspects of biomedicine and agriculture, from protein synthesis to health and disease. Currently the PDB contains over 134,000 data files containing structural information on 42547 distinct protein sequences of which 37600 are human sequences. They also provide a series of tools to search, view and analyse the data.

Downloading an individual pdf file is pretty trivial and can be done from the web page as shown in the image below. They also provide a Download Tool launched as stand-alone application using the Java Web Start protocol. The tool is downloaded locally and must be then opened. I've found this a little temperamental and had issues with Java versions and security settings.

Since I've been making extensive use of the web services to interact with RCSB I decided to explore the use of Python to download multiple files. I started off creating a Jupyter notebook using the web services provided by RCSB.

I've also used variations on this code to create a python script and a Vortex script.

Full details are here …


Comments

Interacting with the RCSB Protein Data Bank

 

The RCSB Protein Data Bank is an absolutely invaluable resource that provides archive-information about the 3D shapes of proteins, nucleic acids, and complex assemblies that helps scientists understand all aspects of biomedicine and agriculture, from protein synthesis to health and disease. Currently the PDB contains over 134,000 data files containing structural information on 42547 distinct protein sequences of which 37600 are human sequences. They also provide a series of tools to search, view and analyse the data.

The latest addition to the Hints and Tutorials page is a couple of Vortex scripts for interacting with the RCSB Protein Data Bank, specifically they search for PDB structures associated with a list of Uniprot codes, and then search for associated information. Read more here…

Comments

Bioconda: A sustainable and comprehensive software distribution for the life sciences

 

Bioconda is a channel for the conda package manager specializing in bioinformatics software.

Bioconda supports only 64-bit Linux and Mac OSX.

Bioconda offers a collection of over 2900 software tools, which are continuously maintained, updated, and extended by a growing global community of more than 250 contributors. Bioconda improves analysis reproducibility by allowing users to define isolated environments with defined software versions, all of which are easily installed and managed without administrative privileges.

The conda package manager has recently made installing software a vastly more streamlined process. Conda is a combination of other package managers you may have encountered, such as pip, CPAN, CRAN, Bioconductor, apt-get, and homebrew. Conda is both language- and OS-agnostic, and can be used to install C/C++, Fortran, Go, R, Python, Java etc

You can read more details in this publication "Bioconda: A sustainable and comprehensive software distribution for the life sciences", doi.

Whilst there are a number of compilaions of Bioinformatics software, Bioconda looks to be by far the most comprehensive.

After installing Conda, the first step is to set up the Bioconda channel

conda config --add channels conda-forge
conda config --add channels bioconda

Packages can then be installed using

conda install cnvkit

This installs CNVkit plus the appropriate Python and R dependencies.


Comments

SAMSON, Software for Adaptive Modeling and Simulation Of Nanosystems

 

SAMSON is a novel software platform for computational nanoscience. Rapidly build models of nanotubes, proteins, and complex nanosystems. Run interactive simulations to simulate chemical reactions, bend graphene sheets, (un)fold proteins. SAMSON's generic architecture makes it suitable for material science, life science, physics, electronics, chemistry, and even education. SAMSON is developed by the NANO-D group at INRIA, and means "Software for Adaptive Modeling and Simulation Of Nanosystems.

samson

SAMSON has an open architecture which allows anyone to extend it - and adapt it to their needs - by downloading SAMSON Elements (modules). SAMSON Elements come in many flavors: apps, editors, controllers, models, parsers, etc., and are adapted to different application domains. SAMSON Elements help users build new models, perform calculations, run interactive or offline simulations, visualize and interpret results, and more. Add new SAMSON Elements to SAMSON straight from SAMSON Connect.

In the latest news Python scripting is coming to SAMSON 0.7.0. Most of the SAMSON API is now exposed in Python, and this will allow you to create models and run simulations, generate movies, perform analysis and reporting, etc., directly from scripts. Python will make it even easier to integrate and pipeline SAMSON and SAMSON Elements with well-known packages from diverse fields, e.g. TensorFlow, PyRosetta, RDKit, ASE, etc., to name a few


Comments

RDKit conformer generation script

 

Pharmacelera we have written a python script to generate conformations with RDKit and made it available here .

Conformer generation is one of the first and most important steps in most ligand based experiments, particularly when the ligand’s 3D structure is unknown. For example, the quality of the conformers could affect the results of virtual screening experiments.


Comments

Rdkit warning

 

I just saw this message on the rdkit mailing list and I thought I'd flag it.

I've noticed a problem with anaconda python on the Mac. This may also be a problem on linux, but I haven't tested that yet.

Due to some changes in the way the anaconda team is doing python builds, the most recent conda python builds seem to no longer work with the RDKit. The symptom is an error message like "Fatal Python error: PyThreadState_Get: no current thread" when you try to import the rdkit.

I've observed this for the newest 3.5 (3.5.4-hf91e95415) and 3.6 (3.6.2-hd0bf7f115) builds. A workaround is to downgrade to 3.5.3 (conda install python=3.5.3) or 3.6.1 (conda install python=3.6.1).

Comments

Scoria: a Python module for manipulating 3D molecular data

 

Just catching up on reading the literature and came across this interesting python paper in Journal of Cheminformatics. DOI.

Scoria is useful for both analyzing molecular dynamics (MD) trajectories and molecular modeling. For example, we have used beta-version Scoria functions to create large-scale lipid-bilayer models, to construct small-molecules models with improved predicted binding affinities, to measure MD-sampled binding-pocket shapes and volumes , and to develop neural-network docking scoring functions, among other applications. As an additional example, in this manuscript we describe a trajectory-analysis Scoria script that colors the atoms of one protein chain by the frequency of their contacts with a second chain.

scoria


Comments

Predicting sites of metabolism Vortex script

 

It is really useful to have two sites of metabolism tools available that use contrasting methodologies, FAME 2 using curated dataset of experimentally determined metabolism data to build a machine learning model using simple descriptors. In contrast SMARTCyp uses precomputed activation energies from density functional theory (DFT) calculations of model compounds.

I previously wrote a script displaying the [results of a SMARTCyp calculation in a webview. The first part of the script imports the smartcyp.jar, however with each update I was finding issues so I thought it might be better to simply treat SMARTCyp as a command line application and use subprocess to access it.

Using a similar script we can also access FAME2

More details here.

somprediction


Comments

chemfp 1.3 released

 

Chemfp is a set of command-line tools and a Python library for working with cheminformatics fingerprints. It can use OEChem/OEGraphSim, RDKit, or Open Babel to create fingerprints in the FPS format, and it implements a high-speed Tanimoto search.

The software is available under the MIT license. For more information see http://chemfp.com/. Documentation is available from http://chemfp.readthedocs.io/en/chemfp-1.3/ .

There are many changes over chemfp 1.1, which was the last release of the public/no-cost version of chemfp. The biggest ones are:

  • Tested against the current version of all of the toolkits

  • Added support for the Avalon and pattern fingerprints in RDKit

  • In-memory Tanimoto searches for 166-bit MACCS keys on computers with the POPCNT instruction is about 30% faster.

  • FPS loading is about 40% faster. As a result, file-based searches are about 25% faster.

  • The in-memory search algorithms in version 1.1 were parallelized with OpenMP, but the NxM k-nearest search was left out. That case is now also parallelized.

  • Some of the APIs from the commercial version were backported to 1.3, including the fingerprint writer API and functions for substructure fingerprint screening.

  • Added and improved docstrings

This release support Python 2.7 but it no longer supports Python 2.5 or Python 2.6. The commercial version supports Python 2.7 and Python 3.5+, handles more than 4GB of fingerprint data, and has a binary fingerprint format for fast loading.

It is available from http://dalkescientific.com/releases/chemfp-1.3.tar.gz.


Comments

Accessing Jupyter Notebook model from Vortex

 

I've become a great fan of Jupyter Notebooks as a way of modelling cheminformatics data, and I've published some of the notebooks here.

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.

In the predicting AMES activity notebook I also looked at the use of pickle to store the predictive model and then access it using a Jupyter notebook without the need to rebuild the model. Whilst a notebook is a nice way to access the predictive model it might also be useful to be able to access it from other applications or from the command line.

In this tutorial we look at providing command line access to the model and then incorporating it into a Vortex script.

Scripting Vortex 38


Comments

Versions of python modules update

 

I the last post I asked about about adding version numbers. Almost immediately I got a brilliant response.

Simply install version_information, using either

pip install version_information

or

conda install version_information

Then

versions

Comments

Versions of python modules

 

I'm in the process of updating the Jupyter notebooks to Python3 and I looking at what I can do make sure other people can reproduce the results. At the moment I annotate the imported python modules with version numbers in the Jupyter notebook. Finding the versions is a bit tedious and I was wondering if there was some way to automate this?

from rdkit import Chem #rdkit 2016.03.5
from rdkit.Chem import PandasTools
import pandas as pd #pandas==0.17.1
import pandas_ml as pdml #pandas-ml==0.4.0
from rdkit.Chem import AllChem, DataStructs
import numpy #numpy==1.12.0
from sklearn.model_selection import train_test_split #scikit-learn==0.18.1
import subprocess
from StringIO import StringIO
import pickle
import os
%matplotlib inline
Comments

Python tutorials for OpenMM

 

This guide is a set of Jupyter notebooks intended to help researchers already familiar with molecular dynamics simulation learn how to use OpenMM in their research and software projects.

# For Mac OS X, substitute `MacOSX` for `Linux` below
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash -b ./Miniconda3-latest-Linux-x86_64.sh -p $HOME/miniconda
export PATH=$HOME/miniconda/bin:$PATH


conda install --yes -c omnia -c conda-forge jupyter notebook openmm mdtraj nglview

There is a detailed document describing OpenMM here

OpenMM is a set of libraries that lets programmers easily add molecular simulation features to their programs, and an “application layer” that exposes those features to end users who just want to run simulations. Instructions for installation under MacOSX are here.

OpenMM works on Mac OS X 10.7 or later. OpenCL is supported on OS X 10.10.3 or later.


Comments

Data-driven Advice for Applying Machine Learning to Bioinformatics Problems

 

A very useful paper https://arxiv.org/abs/1708.05070

Here we contribute a thorough analysis of 13 state-of-the-art, commonly used machine learning algorithms on a set of 165 publicly available classification problems in order to provide data-driven algorithm recommendations to current researchers. We present a number of statistical and visual comparisons of algorithm performance and quantify the effect of model selection and algorithm tuning for each algorithm and dataset. The analysis culminates in the recommendation of five algorithms with hyperparameters that maximize classifier performance across the tested problems, as well as general guidelines for applying machine learning to supervised classification problems.

Good to see my preferred method Random Forest close to the top of the ranking based on performance over 165 datasets.

The rankings show the strength of ensemble-based tree algorithms in generating accurate models: The first, second, and fourth-ranked algorithms belong to this class of algorithms.

All 13 ML algorithms were used as implemented in scikit-learn, a popular ML library implemented in Python.


Comments

PAINS Vortex script

 

One of the great features of the latest version of Vortex (> build 29622) is the ability to script multiple sub-structure searches using SMARTS. There are many occasions when this sort of feature is useful, if you want to flag molecules that contain reactive functional groups, toxicophores, or PAINS functional groups that have been shown to interfere with high-throughput screens. Vortex tutorial 24 described how to do this multi-substructure searching.

There have now been a couple of new publications describing the identification of false positives in high-throughput screening campaigns in which the binding of glutathione S-transferase (GST) to glutathione (GSH) is used for detection of GST-tagged proteins.

  • Identification of Small-Molecule Frequent Hitters of Glutathione S-Transferase–Glutathione Interaction DOI
  • Identification of Small-Molecule Frequent Hitters from AlphaScreen High-Throughput Screens DOI

There have also been some suggestions as to how some of the motifs might be interfering with the assay, as shown below.

PAINS

I've now added the additional structural motif definitions taking the total to 550 SMARTS definitions. It is perhaps worth mentioning that some of these motifs may not be an issue when using alternative screening technologies, but it may be very worthwhile to double check any molecules flagged by this script before committing significant resources to follow up.

This comment in Nature is perhaps worth noting

Academic researchers, drawn into drug discovery without appropriate guidance, are doing muddled science. When biologists identify a protein that contributes to disease, they hunt for chemical compounds that bind to the protein and affect its activity. A typical assay screens many thousands of chemicals. ‘Hits’ become tools for studying the disease, as well as starting points in the hunt for treatments. These molecules — pan-assay interference compounds, or PAINS — have defined structures, covering several classes of compound. But biologists and inexperienced chemists rarely recognize them. Instead, such compounds are reported as having promising activity against a wide variety of proteins. Time and research money are consequently wasted in attempts to optimize the activity of these compounds. Chemists make multiple analogues of apparent hits hoping to improve the ‘fit’ between protein and compound. Meanwhile, true hits with real potential are neglected.

I've updated the tutorial and the scripts for download.


Comments

A workflow for docking/virtual screening part 2

 

In the previous workflow I described docking a set of ligands with known activity into a target protein, in this workflow we will be using a set of ligands from the ZINC dataset searching for novel ligands. Once docked the workflow moves on to finding vendors and selecting subsets for purchase.

dockedligand


Comments

Weekend Reading

 

A couple of things for your weekend reading ;-)

When not to use deep learning

What makes Python super popular

Googles online Python Class

Machine Learning in Python tutorial


Comments

A workflow for docking/virtual screening (updated)

 

Whilst high-throughput screening (HTS) has been the starting point for many successful drug discovery programs the cost of screening, the lack of access to a large diverse sample collection, or the low throughput of the primary assay may preclude HTS as a starting point and identification of a smaller selection of compounds with a higher probability of being a hit may be desired. Directed or Virtual screening is a computational technique used in drug discovery research designed to identify potential hits for evaluation in primary assays. It involves the rapid in silico assessment of large libraries of chemical structures in order to identify those structures that most likely to be active against a drug target. The in silico screen can be based on known ligand similarity or based on docking ligands into the desired binding site.

In this workflow I'll be looking at using docking to identify potential hits.

I've updated the description to give more information about preparing the target protein.


Comments

A Functional Group Count Script

 

I recently wrote a review of Reaction Workflows, a web-based tool that allow users to build workflows from nodes that provide inputs and outputs or perform actions, including ones to perform reaction-, scaffold-, and transform-based enumeration, and it is all done within a web browser interface using drag and drop. Whilst you can draw input structures one of the real strengths is the ability to import pre-categorised reagent files e.g.Acid Chlorides or secondary amines. This script is intended to help with this within Vortex.

This script is a variation of the high performance sub-structure search scripts described previously, however instead of simply flagging the presence (or absence) of a SMARTS query we provide a count of the number of times a SMARTS query is identified within a molecule. The script uses all available cores and is thus capable of running multiple queries in parallel and can thus handle very large datasets. The script currently contains around 70 different SMARTS queries for both functional groups and atom counts and I'd be happy to add any suggestions.

Read more….


Comments

RDKit and Python3

 

Greg Landrum posted the following to the RDKit users and since a couple of the Jupyter Notebooks I've published make extensive use of RDKit I thought I'd flag it.

As many of you are no doubt aware, the Python community plans to discontinue support for Python 2 in 2020. A growing number of projects in the Scientific Python stack are making the same transition and have made that explicit here: http://www.python3statement.org/

I will be adding the RDKit to this list. The RDKit will switch to support only Python 3 by 2020. At some point between now and then - likely during the 2018.09 release cycle - we will create a maintenance branch for Python 2 that will continue to get bug fixes but will no longer have new Python features added. This branch will be maintained, and we will keep doing Python 2 builds, until 2020 when official Python 2 support ends.

Additionally, starting during the 2018.03 release cycle we will accept contributions for new features that are not compatible with Python 2 as long as those features are implemented in such a way that they don't break existing Python 2 code (more on this later). This will allow members of the RDKit community who have made the switch to Python 3 to start making use of the new features of the language in their RDKit contributions.

If you have not made the switch yet to Python 3: please read the web page I link to above and take a look at the list of projects that have committed to transition. The switch from Python 2 to Python 3 isn't always easy, but it's not getting any easier with time and you have a few years to complete it. There are a lot of online resources available to help.

Best Regards, -greg

The list of projects that will be making the transition so far includes; IPython, Jupyter notebook, pandas, Matplotlib SymPy, Astropy, Software Carpentry, SunPy xonsh, scikit-bio, PyStan, Axelrod osBrain, PyMeasure, rpy2, PyMC3, FEniCS, An Introduction to Applied Bioinformatics, music21, QIIME, Altair, gala, cual-id, CIS


Comments

Getting PDB information

 

A while back I published two scripts that use UniChem a web resource provided by the EBI, a 'Unified Chemical Identifier' system, designed to assist in the rapid cross-referencing of chemical structures, and their identifiers, between multiple databases.

Chambers, J., Davies, M., Gaulton, A., Hersey, A., Velankar, S., Petryszak, R., Hastings, J., Bellis, L., McGlinchey, S. and Overington, J.P. UniChem: A Unified Chemical Structure Cross-Referencing and Identifier Tracking System. Journal of Cheminformatics 2013, 5:3 (January 2013). DOI: http://dx.doi.org/10.1186/1758-2946-5-3

The first script uses the ChEMBL ID to search for other identifiers, the second script allows more flexible searching using any of the identifiers available within UnicChem. One of the identifiers returned is from the PDBe (Protein Data Bank Europe) and represents the ID of the ligand in the PDB. Whilst this is interesting it would also be very useful to have the identity of the crystal structures that contain the ligand. Fortunately PBDe provide a series of web services that can be used to interrogate the database, together with a really useful page to help build the calls.

Full details of the script are here..

There is a comprehensive listing of scripts, tips, jupyter notebooks etc here.


Comments

Psi4 1.1: An Open-Source Electronic Structure Program

 

A recent paper describes Psi4 1.1: An Open-Source Electronic Structure Program Emphasizing Automation, Advanced Libraries, and Interoperability DOI

Psi4 is an ab initio electronic structure program providing methods such as Hartree–Fock, density functional theory, configuration interaction, and coupled-cluster theory. The 1.1 release represents a major update meant to automate complex tasks, such as geometry optimization using complete-basis-set extrapolation or focal-point methods. Conversion of the top-level code to a Python module means that Psi4 can now be used in complex workflows alongside other Python tools.

Psi4 1.1 can be downloaded from here with versions supporting Python 2.7, 3.5 and 3.6.

Note the installation instructions for Mac: Install XCode via the App Store, Make sure you open XCode and accept the license agreement after you install.


Comments

Conformer generation

 

The generation of multiple conformations is an important step in a number of operations from input to ab initio calculations to providing input files for docking studies. A recent paper compared seven freely available conformer ensemble generators: Balloon (two different algorithms), the RDKit standard conformer ensemble generator, the Experimental-Torsion basic Knowledge Distance Geometry (ETKDG) algorithm, Confab, Frog2 and Multiconf-DOCK DOI, and also provided a dataset of ligand conformations taken from the PDB.

A recent twitter discussion involving Greg Landrum and David Koes prompted Greg to publish a blog post describing conformation generation within RDKit. The post compares using distance geometry to select diverse conformations versus an approach that combines the distance geometry approach with experimental torsion-angle preferences obtained from small-molecule crystallographic data (ETKDG). He also looks at the impact of force-field minimisation.

A really interesting read with code provided.


Comments

aRMSD: A Comprehensive Tool for Structural Analysis

 

aRMSD is an open toolbox for structural comparison between two molecules with various capabilities to explore different aspects of structural similarity and diversity. Crystallographic data provided from cif files is fully supported and the results can be rendered with the help of the vtk package.

A. Wagner, H.-J. Himmel, J. Chem. Inf. Model, 2017, 57, 428-438 DOI

Comments

Working with MOL2 Structures in DataFrames

 

A great tutorial describing how to use 'Biopandas' MOL2 DataFrames to analyze molecules conveniently.

The Tripos MOL2 format is a common format for working with small molecules.


Comments

A webinar demonstrating using Jupyter, the free iPython notebook

 

This is a recording of the March 2017 Global Health Compound Design meeting. A webinar demonstrating using Jupyter, the free iPython notebook.

https://youtu.be/XqyWctQxhNs

How to get started

Accessing Open Source Malaria data

Calculating physicochemical properties and plotting

Predicting AMES activity.



Comments

Predicting AMES activity Jupyter Notebook

 

I've been experimenting with the use of Jupyter Notebooks (aka iPython Notebooks) as an electronic lab notebook but also a means to share computational models. The aim would be to see how easy it would be to share a model together with the associated training data together with an explanation of how the model was built and how it can be used for novel molecules.

The Ames test is a widely employed method that uses bacteria to test whether a given chemical can cause mutations in the DNA of the test organism. More formally, it is a biological assay to assess the mutagenic potential of chemical compounds. PNAS. 70 (8): 2281–5. doi

In this first notebook a random forest model to predict AMES activity is described….


Comments

Molecular Design Toolkit

 

The Molecular Design Toolkit is an open source environment that aims to seamlessly integrated molecular simulation, visualization and cloud computing. It offers access to a large and still-growing set of computational modelling methods with a science-focused Python API, that can be easily installed using PIP. It is ideal for building into a Jupyter notebook. The API is designed to handle both small molecules and large bimolecular structures, molecular mechanics and QM calculations.

wfn.png

There are a series of Youtube videos describing some of the functionality in more details, starting with this introduction.


Comments

Several ways of scripting Name to Structure

 

Too often I come across datasets that Chemical names or identifiers but no actual molecular structure, recently Dan at Dotmatics suggested I look at OPSIN. There are also several web services for converting names to structure and I've highlighted a couple of options here and described three scripts that allow them to be used from within Vortex.

vortexopsinstructures.png

There are many more scripts on the Hints and Tutorials Page.


Comments

Scripting Vortex 34, analysis of catagorical information

 

I often need to tag individual molecules within a dataset with a specific property, perhaps the results of clustering algorithms, the results of PAINS filtering, or Liver toxicity filters. Alternatively if you have a drug discovery project with multiple chemotypes you might want to tag particular groups of compounds as belonging to a named series to aid analysis.

A question that might then arise is “How many molecules belong to each category?”. Whilst you can see the numbers in the sidebar there is not an easy way to export the results.

Hopefully this script can help.

livertoxoutput


Comments

RDkit and Conda install of postgres cartridge on Mac OS

 

There has been an interesting discussion about installing rdkit-postgresql95 on Mac OS X on the rdkit mailing list and I thought it might be of wider interest.

Here's the resolution of the difficulties I was having installing rdkit-postgresql95 on Mac OS X. The problem turned out to be that the package originally posted used Py3.5, and I'm still using 2.7. I may change to 3.5 at some point, but Greg was kind enough to add a 2.7 version of the package.

So, the following invocations work to set up rdkit with the cartridge in a new env on Mac OS X. I'm on El Capitan, by the way, and for clarity, I've not tested the installation, but only checked that it completed successfully.

conda create -n rdk1 -c rdkit rdkit
. activate rdk1
conda install -c greglandrum rdkit-postgresql95

(The last command also installs postgresql 9.5.4-0.)


Comments

JupyterLab: the next generation of the Jupyter Notebook

 

An interesting development on the Jupyter Blog

It's been a long time in the making, but today we want to start engaging our community with an early (pre-alpha) release of the next generation of the Jupyter Notebook application, which we are calling JupyterLab.

jlab-screenshot-nb-con-term-2

Full presentation is here


Comments

Accessing ZINC supplier information

 

ZINC is a free database of commercially-available compounds for virtual screening. ZINC contains over 100 million purchasable compounds in ready-to-dock, 3D formats. Sterling and Irwin, J. Chem. Inf. Model, 2015. This is an invaluable resource for any type of virtual screening or for anyone looking to create a physical screening or fragment collection.

Once you have done the virtual screening you will rapidly realise that the really time-consuming a tedious part now lies ahead. Finding out which vendors stock a particular molecule and then ordering them. Looking up the vendor details for individual compounds is extremely tedious and so this Vortex script may be very useful.

Many more scripts, iPython notebooks and tutorials can be found here.


Comments

17th annual KDnuggets Software Data Analysis Poll

 

The results of the annual data analysis poll are in and show some interesting trends, in particular the dramatic increase in Python use.

R remains the leading tool, with 49% share (up from 46.9% in 2015), but Python usage grew faster and it almost caught up to R with 45.8% share (up from 30.3%).

Actually looking down the list I notice there is also an entry for scikit-learn, which is Python based, and if you add that in Python is now the most commonly used data analysis tool.

There was a 10% drop in the use of KNIME, and a 36% drop in the use of TIBCO Spotfire two products used in cheminformatics.

In terms of programming languages Python is by far the most extensively used.

Python 45.8% share (was 30.3%) 51% increase
Java 16.8% share (was 14.1%) 19% increase
Unix shell/awk/gawk 10.4% share (was 8.0%) 30% increase
C/C++ 7.3% share (was 9.4%) 23% decrease
Other programming languages 6.8% share (was 5.1%) 34.1% increase

In the Big Data area Hadoop (22.1%) and Spark (21.6%) dominate.

There is a listing of data analysis tools for MacOSX here.


Comments

jupyter-docker-pymol

 

I came across the jupyter-docker-pymol recently and thought I'd give it a mention. It is a Container-based installation of PyMol, with interaction through the browser via ipymol and Jupyter notebook (based on jupyter/notebook).

This project uses PyMol 1.8.2.0 and Python 3

png0


Comments

RDkit updated

 

RDkit has been updated .

If you used home-brew to install RDkit as described here updating is very simple

brew update
brew upgrade rdkit

You can check which version you have installed using

MacPro> python
Python 2.7.11 (default, Dec 23 2015, 16:11:50) 
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from rdkit import rdBase
>>> print rdBase.rdkitVersion
2016.03.1
>>>

Comments

Dealing with Greek characters in column names

 

This is just a very quick tip when dealing with Greek characters in Vortex column names when creating a script. It may be obvious to many but I struggled for several hours before finding the problem and a solution

Read more…


Comments

MCPB.py: A Python Based Metal Center Parameter Builder

 

MCPB.py, a python based metal center parameter builder, has been developed to build force fields for the simulation of metal complexes employing the bonded model approach.

Pengfei Li and Kenneth M. Merz, Jr., "MCPB.py: A Python Based Metal Center Parameter Builder." J. Chem. Inf. Model., 2016, Accepted, DOI.

There is an excellent and very detailed online page describing the use of MCPB.py http://ambermd.org/tutorials/advanced/tutorial20/mcpbpy.htm.


Comments

Flexible UniChem Search

 

UniChem is a web resource provided by the EBI, it is a 'Unified Chemical Identifier' system, designed to assist in the rapid cross-referencing of chemical structures, and their identifiers, between multiple databases. Currently the UniChem contains data from 27 different data sources. Currently UniChem provides links to 108,941,995 structures.

Chambers, J., Davies, M., Gaulton, A., Hersey, A., Velankar, S., Petryszak, R., Hastings, J., Bellis, L., McGlinchey, S. and Overington, J.P. UniChem: A Unified Chemical Structure Cross-Referencing and Identifier Tracking System. Journal of Cheminformatics 2013, 5:3 (January 2013). DOI: http://dx.doi.org/10.1186/1758-2946-5-3

The previous script showed how to search using ChEMBLID, however one of the attractions of UniChem is that you can search with any molecule identifier if you know the corresponding datasource. This script allows the user to use any molecule identifiers and then search a specified datasource using a common web service.

Read more …


Comments

Getting UniChem data from ChEMBL

 

UniChem is a web resource provided by the EBI, it is a 'Unified Chemical Identifier' system, designed to assist in the rapid cross-referencing of chemical structures, and their identifiers, between multiple databases. Currently the UniChem contains data from 27 different data sources. Currently UniChem provides links to 108,941,995 structures.

Chambers, J., Davies, M., Gaulton, A., Hersey, A., Velankar, S., Petryszak, R., Hastings, J., Bellis, L., McGlinchey, S. and Overington, J.P. UniChem: A Unified Chemical Structure Cross-Referencing and Identifier Tracking System. Journal of Cheminformatics 2013, 5:3 (January 2013). DOI: http://dx.doi.org/10.1186/1758-2946-5-3

ChEMBL also provide a RESTful Web service that users can use to retrieve data from the UniChem database in a programmatic fashion.

Read more…


Comments

Flagging potential aggregators in Vortex

 

Promiscuous inhibition caused by small molecule aggregation is a major source of false positive results in high-throughput screening. A recent particularly valuable publication, Irwin, Duan, Torosyan, Doak, Ziebart, Sterling, Tumanian and Shoichet, J Med Chem, 2015, 58(1 7), 7076-7087 DOI, has collated over 12,000 organic molecules known to act as aggregators at concentrations used in screening campaigns, and provides a resource Aggregation Advisor that can be used to try and predict possible false positives. However in many instances it would be unwise to submit proprietary information to the public web service. Potential aggregators are flagged based on calculated LogP >3 and/or similarity >0.85 to a known aggregator (using path based fingerprint) this script calculates xLogP using the algorithm provided by Dotmatics and then uses OpenBabel fast search to calculate the closest similarity to a known aggregator.

Full details of the Vortex script are here.

xlogpaggscore

Comments

LSH-based similarity search in MongoDB is faster than postgres cartridge

 

There is a great blog article on ChEMBL-og, describing their work evaluating chemical structure based searching in MongoDB. MongoDB is a NoSQL database designed for scalability and performance that is attracting a lot of interest at the moment.

The article does a great job in explaining the logic behind improving the search performance.

They also provide an iPython notebook so you can try it yourself.

Comments

Finding Duplicate structures

 

It is always interesting to note which scripts attract the most attention, often it is scripts that aid with relatively simple tasks. Among the Applescripts it is the script to simply print the clipboard.

Recently I wrote a script to remove duplicate structures from within Vortex

When working with multiple data sets of molecules, particularly if combining them from multiple sources, one of the most common tasks is removal of duplicates. This can be a time-consuming and error prone process if carried out manually and this script should hopefully make this a much easier task.

This seems to have attracted interest but I got a comment that it "works fine but is slow for larger data sets". So I've been looking at improving performance.

In order to test the performance I took around 150,000 random structures from ChEMBL and then duplicated 0.01% to give a test set of 160,146 molecules. The original version of the script took 95 mins, using the same test set, version 2 of the script took less than 3 mins! This increase in performance means that it is now practical to use the script on much larger datasets.

You can read full details and download it here.

There are many more Hints, scripts and tutorials here.

Comments

Polyphony

 

Polyphony is an open source software suite written in python. Its purpose is the superimposition free analysis and comparison of multiple 3D structures of the same or closely related protein molecules.

Absolute Requirements

python 2.6 or later, scipy, numpy, Biopython, especially the Bio.PDB module

Highly recommended

All following documentation assumes that you have these installed.

ipython , for interactive python scripting, matplotlib, for graph plotting, PyMOL, for interactive 3D visualisation. Open source version available on SourceForge

William R Pitt, Rinaldo W Montalvão and Tom L Blundell, BMC Bioinformatics, 2014, 15:324 doi

Comments

ChEMBL python update.

 

Excellent blog post on the ChEMBL python update.

http://chembl.blogspot.co.uk/2015/07/chembl-python-client-update.html

Comments

Accessing Open Source Malaria Data using an iPython Notebook

 

The Open Source Malaria project is trying a different approach to curing malaria. Guided by open source principles, everything is open and anyone can contribute. To date a lot of people around the world have made contributions and the project is at a very exciting stage. Whilst everyone can see the compounds that have been made and the biological data, it is often spread over multiple web pages and can be tricky to link molecule with identifier with data. Over the last couple of months a significant effort has been put into populating a spreadsheet with all the information.

I've recently published a Vortex script to access the information, I've now published an iPython notebook that also shows how to import the data. Why not give it a try and then contribute your findings and suggestions to the Open Source Malaria project.

Comments

Script to remove duplicates in Vortex

 

When working with multiple data sets of molecules, particularly if combining them from multiple sources, one of the most common tasks is removal of duplicates. This can be a time-consuming and error prone process if carried out manually and this script should hopefully make this a much easier task.

http://macinchem.org/reviews/vortex/tut27/scripting_vortex27.php.

There are many more Hints, scripts and tutorials here.

Comments

Installing Open Drug Discovery Toolkit (ODDT)

 

A recent paper in J Cheminformatics described Open Drug Discovery Toolkit (ODDT): a new open-source player in the drug discovery field DOI a free and open source tool for both computer aided drug discovery (CADD) developers and researchers. Open Drug Discovery Toolkit is released on a permissive 3-clause BSD license for both academic and industrial use. ODDT’s source code, additional examples and documentation are available on GitHub.

To install ODDT on a Mac you first need to install the appropriate toolkits, the easiest way is to use Homebrew, I've written a page detailing how to do this here.

Once installed you can install ODDT using PIP as described here.

Comments

Poll on data analysis tools

 

The results of the 16th annual KDnuggets Software Poll on data analysis tools is in.

The top 10 tools by share of users were

R, 46.9% share ( 38.5% in 2014, 37% in 2013)
RapidMiner, 31.5% ( 44.2% in 2014, 39% in 2013)
SQL, 30.9% ( 25.3% in 2014, NA in 2013)
Python, 30.3% ( 19.5% in 2014, 13% in 2013)
Excel, 22.9% ( 25.8% in 2014, 28% in 2013)
KNIME, 20.0% ( 15.0% in 2014, 6% in 2013)
Hadoop, 18.4% ( 12.7% in 2014, 9% in 2013)
Tableau, 12.4% ( 9.1% in 2014, NA 2013)
SAS, 11.3 (10.9% in 2014, 10.7% in 2013)
Spark, 11.3% ( 2.6% in 2014, NA in 2013)

The results very much reflect my own interactions, whilst R has a significant installed user base and of course a vast repository of open source packages, Python seems to be gaining traction. Certainly in part because Python seems to have become the lingua franca for scientific computing.

I've always thought of KNIME and Tableau as excellent tools for implementing workflows but looking at recent iterations it is clear there is now greater emphasis on interactive analysis.

There is a listing of data analysis tools for Mac OS X here.

Comments

Scripting Vortex 25

 

Whilst most of the Vortex scripts mentioned on this site to date involve chemical structures we should not forget that Vortex is an excellent general data analytics tool and the data set does not have to include any molecular structures. Recently I was asked about the number of publications associated with a particular potential therapeutic target and it struck me that Vortex might actually be an excellent tool to investigate this.

Read More.

vorte25_1

Comments

PYMOL under Yosemite

 

Reading through the discussion on Scientific Applications under Yosemite it seems some people are having problems with PYMOL, I thought I'd mention that installation of PYMOL using Homebrew is included on the page describing how to set up a Mac for Cheminformatics. The page also describes how to install a wide range of other useful tools.

Comments

Substructure searching very large compound collections.

 

I described the use of the ability to script in Vortex multiple sub-structure searches using SMARTS. There are many occasions when this sort of feature is useful, if you want to flag molecules that contain reactive functional groups, toxicophores, or PAINS functional groups that have been shown to interfere with a variety of screens. Whilst the script worked fine it was rather slow for larger datasets, in the latest tutorial you can see how to take advantage of some of the latest features in Vortex to substantially improve search speeds allowing searching of 70 million compound collections on a desktop.

Scripting Vortex 24:- Substructure searching very large compound collections.

There are many more scripts listed on the Hints and Tutorials Page.

Comments

PyCharm

 

I must admit I’m a big fan of BBEdit for all my text editing, Markdown and python programming but I still keep an eye out for interesting alternatives. I was recently sent a link to PyCharm a Python IDE. PyCharm's code editor provides support for Python, JavaScript, CoffeeScript, TypeScript, CSS, and a number of other languages. What caught my eye was the recently added support for iPython notebooks, with PyCharm 4 you can perform all the usual IPython Notebook actions with *.ipynb files. Everything you're used to doing with the ordinary IPython Notebook is now supported inside PyCharm.

ipythonnotebook

Another very useful features for scientific programming is the NumPy array viewer to easily get a graphical view of a NumPy array and support for matplotlib.

There is a really comprehensive support section that includes demos and screencasts .

Comments

Vortex scripts to access ChEMBL

 

ChEMBL is a manually curated chemical database of bioactive molecules . It is maintained by the European Bioinformatics Institute (EBI), of the European Molecular Biology Laboratory (EMBL), based at the Wellcome Trust Genome Campus, Hinxton, UK. The database currently contains over 1.4 million unique structures with the associated activity at 10,579 different targets. It also acts as a repository for Open Access primary screening and medicinal chemistry data directed at neglected diseases.

Whilst the database can be downloaded, the data can also be accessed via a web interface (shown below) and a series of web services, these Vortex scripts show how it is possible to pull data from ChEMBL into Vortex.

As usual I’ve written it as a tutorial to try and offer some explanation how the script works, Scripting Vortex 23:- Accessing ChEMBL using Web Services

I think this rather nicely shows the power of web services and json.

There is a list of other Vortex scripts on the Hints and Tutorials page

Comments

Scripting Vortex

 

One of the really neat features of the latest version of Vortex (> build 29622) is the ability to script multiple sub-structure searches using SMARTS. There are many occasions when this sort of feature is useful, if you want to flag molecules that contain reactive functional groups, toxicophores, or PAINS functional groups that have been shown to interfere with a variety of screens. Alternatively if you have a drug discovery project with multiple chemotypes you might want to tag particular groups of compounds as belonging to a named series to aid analysis.

The latest Vortex tutorial/script shows how to do this.


Comments

Cheminformatics iPython notebook

 

George Papadatos, from the ChEMBL group, has produced a superb iPython notebook tutorial demonstrating the use of RDkit.

ipypng

Comments

Sage mathematics software

 

Sage is a Python based free open-source mathematics software system licensed under the GPL. It builds on top of nearly 100 open-source packages: NumPy, SciPy, matplotlib, Sympy, Maxima, GAP, FLINT, R to provide a common unified interface, either as a notebook in a web browser or the command line.

In addition to a local installation it is also possible to use SageMathCloud a free service with support from the University of Washington.

I’ve added Sage to the list of data analysis tools for Mac OS X.

Comments

Computable

 

I’ve just added Computable to the mobile science site. Computable brings iPython and SciPy to the iPad allowing to create, edit and run iPython notebooks on your iPad. Computable comes with full-featured SciPy stack, Numpy, SciPy, SymPy, Pandas, Matplotlib. The free download includes a series of 32 example notebooks and lectures that allow you to evaluate the application. To create or edit your own notebooks you will have to make an in app purchase to unlock the full feature set ($9.99).

photo

Comments

Learning Python

 

It seems that Python is becoming the preferred language for scripting in science and I wrote a getting started page for Chemists and several people have pointed out a couple of resources that may be useful in particular Roaslind.

Rosalind is a platform for learning bioinformatics and programming through problem solving.

I looks like an excellent starting point for newcomers and more experienced programmers, whilst focussed on bioinformatics the exercises are useful for all disciplines.

For chemists chempython looks to be a very useful resource.

Comments

Vortex script to classify acid, base, neutral

 

I’ve written several Vortex scripts that use external tools to calculate physicochemical properties including the use of ChemAxon (e.g. charge, pKa, logP, logD). However I often need to simply classify molecules as acid, base, neutral or zwitterion, so I’ve updated the script to create another column containing a text annotation.

vortex3

Comments

Scripting Vortex 21, displayling web pages

 

Well things can change quickly at times, in the last tutorial I wrote..

Vortex has a limited capacity to render HTML, it is however a very limited ability so there is no support for javascript or CSS but you can introduce a number of useful extra features.

If you download the latest daily build of Vortex from the Dotmatics Support site there is a version that comes bundles with Java 8, if you download this version are a host of new options for displaying plots. In particular you can now display web pages, follow links on pages, and there is support for javascript.

In Scripting Vortex 21 there is a demonstration of this feature and an example script that uses SMARTCyp to predict sites of metabolism.

plotDemo3

There are many more scripts on the Hints and Tutorials Page.

Comments

Scripting Vortex 19 Updated

 

This is another Vortex script, this one is used to implement a central nervous system penetration (CNS) algorithm described in the literature.

It is clear from many publications that a number of physicochemical properties influence central nervous system (CNS) penetration and it is often possible to play off one property against another in an effort to improve CNS penetration. An interesting paper from Wagner et al Moving beyond Rules: The Development of a Central Nervous System Multiparameter Optimization (CNS MPO) Approach To Enable Alignment of Druglike Properties describes an algorithm to score compounds with respect to CNS penetration.

The CNS MPO score was built based on six fundamental physicochemical properties: ClogP, ClogD, MW, TPSA, HBD, and pKa each weighted from 0 to 1.0

Update

One of the popular features in Vortex is to colour code columns, this is done automatically but sometimes you want to customise the colouring. For example in one set of values smaller values might be better, in another columns (perhaps an off-target activity) larger numbers might be better. Chatting to Sune Askjær, the author of the Unichem Script, it seemed that this might be a nice addition to this script.

The updated script is here.

Comments

Scripting Vortex 19

 

This is another Vortex script, this one is used to implement a central nervous system penetration (CNS) algorithm described in the literature.

It is clear from many publications that a number of physicochemical properties influence central nervous system (CNS) penetration and it is often possible to play off one property against another in an effort to improve CNS penetration. An interesting paper from Wagner et al Moving beyond Rules: The Development of a Central Nervous System Multiparameter Optimization (CNS MPO) Approach To Enable Alignment of Druglike Properties describes an algorithm to score compounds with respect to CNS penetration.

The CNS MPO score was built based on six fundamental physicochemical properties: ClogP, ClogD, MW, TPSA, HBD, and pKa each weighted from 0 to 1.0 full details of the script are here.

Comments

Python, Chemistry and a Mac 1

 

After I posted the page on setting up a Mac for Cheminformatics I was asked if I could do something similar for writing chemistry (or Science in general) Python scripts on a Mac. So I’ve written a “How to” page on setting up your Mac to use the iPython notebook and write simple scripts that use Pybel to access OpenBabel.

The page is here Python, Chemistry and a Mac 1, and I’ll probably add more pages/scripts in the future.

Comments

Cheminformatics on a Mac

 

I’ve recently needed to set up a new Mac and I realised that the current installation process for all the applications, tools, chemistry toolboxes, and associated dependencies was unmanageable. I have a mixture of apps that I have compiled myself, others that I have simply used the precompiled binaries, others from Macports etc.

I decided to write a detailed account of the process of installing a number of toolkits and packages using Homebrew and PIP.

You can read the full account here in the hints and tutorials.

I’d be delighted to hear of any comments or suggestions for addition.



Comments

Scripting Vortex 17 tutorial

 

In the tutorial Scripting Vortex 15 I showed how it is possible to create a contextual script for Vortex that downloaded a specific PDB file, then a FlexAlign Vortex script first identifies the structure column and then get the SMILES string of the selected molecule generates a 3D structure and uses Flex Align to do a one-shot flexalign between the ligand in the system in MOE, and the incoming ligand.

While this is useful if you have similar structures (perhaps analogues in a series) there will certainly be situations where it may be preferable to dock the new ligand into the binding site. The Scripting Vortex 17 tutorial describes how to achieve this.

Comments

PDBinout

 

Ever had problems with an unusually formatted PDB file? PDBinout is a file conversion tool for PDB files that might interest you. It was created by Tomasz Woźniak at the Laboratory of Structural Chemistry of Nucleic Acids, Institute of Bioorganic Chemistry, Polish Academy of Sciences

PDB format is the most commonly used by various programs to define three-dimensional structure of biomolecules. Those programs however, often use different versions of this format. Therefore, it is often necessary to write own re-formatting scripts or change files manually, which makes PDB files less convenient to use. There are only few tools allowing to change one or two versions of PDB format into another and no comprehensive approach for unifying PDB format was developed. Here we present an open-source, Python-based tool PDBinout for processing and conversion of various versions of PDB file format for biostructural applications. Moreover, PDBinout allows to create one’s own PDB versions.

The download also includes a tutorial.

Reference Woźniak T. and Adamiak R.W. (2013) Personalization of structural PDB files, Acta Biochimica Polonica 60, Paper in Press

Comments

Scripting Vortex 16

 

OCHEM is a free open access site of annotated models and chemical data. OCHEM contains 1831772 experimental records for about 477 properties collected from 12457 sources you are free to upload your own data and also build predictive models using existing or your own data.

There are also a number of already built models that the public can access, these include

  • Ames test
  • CYP1A2 inhibition
  • LogP and Solubility

You can run predictions on OCHEM using simple REST-like web services, these vortex scripts submit tasks to the various models and then retrieve the resulting prediction.

Comments

Scripting Vortex and MOE

One of the new features in the latest version of MOE from Chemical Computing Group is the Listener. The MOE socket listener provides an alternative to MOE/web for executing functions remotely on a running instance of MOE.

The script will download the associated PDB structures from the rcsb Protein Data Bank, put them into a database then start the browser. It may take a few seconds to download the structure; this does rely on MOE having the right proxy settings to access the internet (use the Java console to set them). You can now transfer this to MOE and amend the display to highlight the ligand.

The MOEflexalign script takes the SMILES string of the selected row generates a 3D structure and does a one-shot flexalign between the ligand in the system in MOE, and the incoming ligand.

It is probably easier to see this in action, if it appears rather small click on the YouTube icon in the bottom right corner of the video.

Full details are here




Comments

Scripting Vortex 14

In tutorial 4 we looked at using the command line tool sddesc from Chemical Computing Group to calculate a number of molecular descriptors and then import them into Vortex. However there a couple of issues with doing this not the least ensuring all the environment variables are set correctly. An alternative is to use MOE as a web service and access the tools using the SOAP protocol (Simple Object Access Protocol). This protocol provides a specification for exchanging structured information in the implementation of Web Services in computer networks. It relies on XML Information Set for its message format.

Full details are here...



Comments

Chocolat Mac text editor

I’m a long time BBEdit user but I do keep an eye out for Mac text editors. Chocolat is a new text editor for the Mac that might be worth looking at, it supports split-window editing, code folding, and code completion. It can be used for a wide range of programming and scripting languages.

Comments

KosmicTask

KosmicTask is an integrated scripting environment for Mac OS X. Whilst Mac OS X supports a number of scripting technologies either via it’s UNIX roots (Shell scripting , Perl etc.) or via Cocoa Framework Scripting using Apple’s scripting bridge (Applescript, Ruby, Python etc.) you can end up using a different script editor for each scripting language. KosmicTask allows you to script in a wide variety of languages from within a single editor. KosmicTask uses a plugin architecture that allows it to support a range of scripting languages, details of the languages supported by KosmicTask are shown below:-

KosmicTask also supports another very capable means of achieving automation - appscript. Appscript is supported by both Ruby and Python an alternative to the ScriptingBridge. 

It also allows sharing of scripts with other KosmicTask users via the local shared network.

KosmicTask

I’ve also added it to the list of Applescript Resources.

Comments