FPSim2 is a new tool for fast similarity search on big compound datasets (>100 million) being developed at ChEMBL. It was developed as a Python3 library to support either in memory or out-of-core fast similarity searches on such dataset sizes.
It is built using RDKit and can be installed using conda. It requires Python 3.6 and a recent version of RDKit..
I've written a couple of Jupyter notebooks to demonstrate it's use.
You can read the full tutorial here, and download the notebooks.
Small molecules can potentially bind to a variety of bimolecular targets and whilst counter-screening against a wide variety of targets is feasible it can be rather expensive and probably only realistic for when a compound has been identified as of particular interest. For this reason there is considerable interest in building computational models to predict potential interactions. With the advent of large data sets of well annotated biological activity such as ChEMBL and BindingDB this has become possible.
ChEMBL 24 contains 15,207,914 activity data on 12,091 targets, 2,275,906 compounds, BindingDB contains 1,454,892 binding data, for 7,082 protein targets and 652,068 small molecules.
These predictions may aid understanding of molecular mechanisms underlying the molecules bioactivity and predicting potential side effects or cross-reactivity.
Whilst there are a number of sites that can be used to predict bioactivity data I'm going to compare one site, Polypharmacology Browser 2 (PPB2) http://ppb2.gdb.tools with two tools that can be downloaded to run the predictions locally. One based on Jupyter notebooks models built using ChEMBL built by the ChEMBL group https://github.com/madgpap/notebooks/blob/master/targetpred21_demo.ipynb and a more recent random forest model PIDGIN. If you are using proprietary molecules it is unwise to use the online tools.
I'm constantly impressed by the expansion of Jupyter it is rapidly becoming the first-choice platform for interactive computing.
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.
Swift for TensorFlow is a new way to develop machine learning models. It gives you the power of TensorFlow directly integrated into the Swift programming language. With Swift, you can write the following imperative code, and Swift automatically turns it into a single TensorFlow Graph and runs it with the full performance of TensorFlow Sessions on CPU, GPU and TPU.
Requires MacOS 10.13.5 or later, with Xcode 10.0 beta or later
I always keep an eye out for the polls on KDnuggets, the latest one looks at Python editors or IDEs, over 1900 people took part and the results are shown below (users could select up to 3). There is more detail in the linked page.
I've become a great fan of Jupyter, and not only for Python.
I've been using Jupyter notebooks for a little while but I only just recently found out that you can embed LaTeX or MathML into a notebook!
This notebook is just a series of examples of what can be done. You can embed equations inline or have them on a separate line in a markdown text cell. Or in a code cell by importing Math or invoking latex.
Recently ChEMBL was updated to version 24 the update contains:
- 2,275,906 compound records
- 1,828,820 compounds (of which 1,820,035 have mol files)
- 15,207,914 activities
- 1,060,283 assays
- 12,091 targets
- 69,861 documents
In addition today they released the predictive models built on the updated database, they can be downloaded from the ChEMBL ftp server ftp://ftp.ebi.ac.uk/pub/databases/chembl/target_predictions
There are 1569 models.
A recent paper "The Catch-22 of Predicting hERG Blockade Using Publicly Accessible Bioactivity Data" DOI described a classification model for HERG activity. I was delighted to see that all the datasets used in the study, including the training and external datasets, and the models generated using these datasets were provided as individual data files (CSV) and Python Jupyter notebooks, respectively, on GitHub https://github.com/AGPreissner/Publications).
The models were downloaded and the Random Forest Jupyter Notebooks (using RDKit) modified to save the generated model using pickle to store the predictive model, and then another Jupyter notebook was created to access the model without the need to rebuild the model each time. This notebook was exported as a python script to allow command line access, and Vortex scripts created that allow the user to run the model within Vortex and import the results and view the most significant features.
All models and scripts are available for download.
Well after my last post about Swift and Jupyter a reader sent me link to the use of both Julia and Fortran programming languages in a Jupyter Notebook.
More information in this lecture Project Jupyter: Architecture and Evolution of an Open Platform for Modern Data Science by Fernando Perez.
Project Jupyter, evolved from the IPython environment, provides a platform for interactive computing that is widely used today in research, education, journalism and industry. The core premise of the Jupyter architecture is to provide tools for human-in-the-loop interactive computing. It provides protocols, file formats, libraries and user-facing tools optimized for the task of humans interactively exploring problems with the aid of a computer, combining natural and programming languages in a common computational narrative.
I'm a great fan of Jupyter Notebooks but I only ever use python.
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text
A recent post by Ray Yamamoto Hilton caught my eye who recently put together a little experiment to demonstrate using Swift 4.1 from within Jupyter Notebooks.
You can download a demo notebook here.
The RCSB Protein Data Bank is an absolutely invaluable resource that provides archive-information about the 3D shapes of proteins, nucleic acids, and complex assemblies that helps scientists understand all aspects of biomedicine and agriculture, from protein synthesis to health and disease. Currently the PDB contains over 134,000 data files containing structural information on 42547 distinct protein sequences of which 37600 are human sequences. They also provide a series of tools to search, view and analyse the data.
Downloading an individual pdf file is pretty trivial and can be done from the web page as shown in the image below. They also provide a Download Tool launched as stand-alone application using the Java Web Start protocol. The tool is downloaded locally and must be then opened. I've found this a little temperamental and had issues with Java versions and security settings.
Since I've been making extensive use of the web services to interact with RCSB I decided to explore the use of Python to download multiple files. I started off creating a Jupyter notebook using the web services provided by RCSB.
I've also used variations on this code to create a python script and a Vortex script.
I've become a great fan of Jupyter Notebooks as a way of modelling cheminformatics data, and I've published some of the notebooks here.
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.
In the predicting AMES activity notebook I also looked at the use of pickle to store the predictive model and then access it using a Jupyter notebook without the need to rebuild the model. Whilst a notebook is a nice way to access the predictive model it might also be useful to be able to access it from other applications or from the command line.
In this tutorial we look at providing command line access to the model and then incorporating it into a Vortex script.
I'm in the process of updating the Jupyter notebooks to Python3 and I looking at what I can do make sure other people can reproduce the results. At the moment I annotate the imported python modules with version numbers in the Jupyter notebook. Finding the versions is a bit tedious and I was wondering if there was some way to automate this?
from rdkit import Chem #rdkit 2016.03.5 from rdkit.Chem import PandasTools import pandas as pd #pandas==0.17.1 import pandas_ml as pdml #pandas-ml==0.4.0 from rdkit.Chem import AllChem, DataStructs import numpy #numpy==1.12.0 from sklearn.model_selection import train_test_split #scikit-learn==0.18.1 import subprocess from StringIO import StringIO import pickle import os %matplotlib inline
This guide is a set of Jupyter notebooks intended to help researchers already familiar with molecular dynamics simulation learn how to use OpenMM in their research and software projects.
# For Mac OS X, substitute `MacOSX` for `Linux` below wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh bash -b ./Miniconda3-latest-Linux-x86_64.sh -p $HOME/miniconda export PATH=$HOME/miniconda/bin:$PATH conda install --yes -c omnia -c conda-forge jupyter notebook openmm mdtraj nglview
There is a detailed document describing OpenMM here
OpenMM is a set of libraries that lets programmers easily add molecular simulation features to their programs, and an “application layer” that exposes those features to end users who just want to run simulations. Instructions for installation under MacOSX are here.
OpenMM works on Mac OS X 10.7 or later. OpenCL is supported on OS X 10.10.3 or later.
In the previous workflow I described docking a set of ligands with known activity into a target protein, in this workflow we will be using a set of ligands from the ZINC dataset searching for novel ligands. Once docked the workflow moves on to finding vendors and selecting subsets for purchase.
Whilst high-throughput screening (HTS) has been the starting point for many successful drug discovery programs the cost of screening, the lack of access to a large diverse sample collection, or the low throughput of the primary assay may preclude HTS as a starting point and identification of a smaller selection of compounds with a higher probability of being a hit may be desired. Directed or Virtual screening is a computational technique used in drug discovery research designed to identify potential hits for evaluation in primary assays. It involves the rapid in silico assessment of large libraries of chemical structures in order to identify those structures that most likely to be active against a drug target. The in silico screen can be based on known ligand similarity or based on docking ligands into the desired binding site.
I've updated the description to give more information about preparing the target protein.
This is a recording of the March 2017 Global Health Compound Design meeting. A webinar demonstrating using Jupyter, the free iPython notebook.
How to get started
Accessing Open Source Malaria data
Calculating physicochemical properties and plotting
Predicting AMES activity.
I've now written a couple of Jupyter notebooks and one of the issues that has come up is how to share the notebooks in a way that ensures the results will be reproducible in an environment when updates to components occur regularly.
Binder is a collection of tools for building and executing version-controlled computational environments that contain code, data, and interactive front ends, like Jupyter notebooks. It's 100% open source.
At a high level, Binder is designed to make the following workflow as easy as possible
- Users specify a GitHub repository
- Repository contents are used to build Docker images
- Deploy containers on-demand in the browser on a cluster running Kubernetes
Common use cases include:
- sharing scientific work
- sharing journalism
- running tutorials and demos with minimal setup
- teaching courses
If you want to find out more have a look at this blog post by the developers.
I've been experimenting with the use of Jupyter Notebooks (aka iPython Notebooks) as an electronic lab notebook but also a means to share computational models. The aim would be to see how easy it would be to share a model together with the associated training data together with an explanation of how the model was built and how it can be used for novel molecules.
The Ames test is a widely employed method that uses bacteria to test whether a given chemical can cause mutations in the DNA of the test organism. More formally, it is a biological assay to assess the mutagenic potential of chemical compounds. PNAS. 70 (8): 2281–5. doi
In this first notebook a random forest model to predict AMES activity is described….
The Molecular Design Toolkit is an open source environment that aims to seamlessly integrated molecular simulation, visualization and cloud computing. It offers access to a large and still-growing set of computational modelling methods with a science-focused Python API, that can be easily installed using PIP. It is ideal for building into a Jupyter notebook. The API is designed to handle both small molecules and large bimolecular structures, molecular mechanics and QM calculations.
There are a series of Youtube videos describing some of the functionality in more details, starting with this introduction.
This blog post looks very interesting, a notebook environment for coding, data visualisation based on Juypter (aka iPython) notebooks
With nteract, you can create documents, that contain executable code, textual content, and images, and convey a computational narrative. Unlike Jupyter, your documents are stand-alone, cross-platform desktop applications, providing a seamless desktop experience and offline usage.
More details are on GitHub.