Modin is a library designed to accelerate Pandas by automatically distributing the computation across all of the system’s available CPU cores. Modin uses Ray to provide an effortless way to speed up your pandas notebooks, scripts, and libraries. Unlike other distributed DataFrame libraries, Modin provides seamless integration and compatibility with existing pandas code. Even using the DataFrame constructor is identical. Modin is a DataFrame designed for datasets from 1MB to 1TB+
It can be installed using PIP
pip install modin
If you don't have Ray or Dask installed, you will need to install Modin with one of the targets:
pip install modin[ray] # Install Modin dependencies and Ray to run on Ray pip install modin[dask] # Install Modin dependencies and Dask to run on Dask pip install modin[all] # Install all of the above
Currently, Modin depends on pandas version 0.23.4.
I've added Modin to the Open Source Data Science Python Libraries.
I've recently become interested the comparison of the amino amino-acid composition of peptides, to allow comparison of cyclic versus linear peptides, or brain penetrant curses non-penetrant. I had a look around but could not find any tools that did this, in particular I wanted to include any non-proteinergic amino-acids.
This tutorial provides a means to analyse many thousands of peptides using Vortex.
As a regular Jupyter/Python user this publication (PLoS Comput Biol 15(7): e1007007) DOI is a great reminder of good practice, and as Jupyter becomes increasingly popular as a means to share code/data/results writing the notebook in a manner that helps readers is increasingly important.
This ability to combine executable code and descriptive text in a single document has close ties to Knuth’s notion of “literate programming” and has convinced many researchers to switch to computational notebooks from other programming environments. Jupyter Notebooks in particular have seen widespread adoption: as of December 2018, there were more than 3 million Jupyter Notebooks shared publicly on GitHub (https://www.github.com), many of which document academic research.
There are of course many different ways to share Jupyter notebooks.
Whether you use notebooks to track preliminary analyses, to present polished results to collaborators, as finely tuned pipelines for recurring analyses, or for all of the above, following this advice will help you write and share analyses that are easier to read, run, and explore.
This looks like it could be very interesting.
A blog post by Greg Landrum a widget for displaying molecules where you can select atoms and find out which atoms are selected propagating to Python in a Jupyter Notebook.
This is basic, but I think it's a decent start towards something that could be really useful. Interested? Have suggestions (ideally accompanied by code!) on how to improve it? If it looks like this is actually likely to be used, I will figure out how to create a standalone nbwidget out of this and create a separate github repo for it.
Looks like a useful tool for selecting bonds for conformational analysis, selecting bonds for creating a Ramachandran plot, selecting groups for bioisosteric replacement……
Sounds like Greg is looking for input.
I was recently asked for a tool to compare the similarity of a list of molecules with every other molecule in the list. I suspect there may be commercial tools to do this but for small numbers of compounds it is easy to visualise in a Jupyter notebook using RDKit.
Read more here, MolecularSimilarityNotebook
I'm a great fan of Jupyter notebooks and I'm always looking for ways to get more out of them. I came across this blog post recently which is packed with useful tips
Whenever someone says ‘You can do that with an extension’ in the Jupyter ecosystem, it is often not clear what kind of extension they are talking about. The Jupyter ecosystem is very modular and extensible, so there are lots of ways to extend it. This blog post aims to provide a quick summary of the most common ways to extend Jupyter, and links to help you explore the extension ecosystem.
I've also published some notebooks under Tips and Tutorials, Jupyter notebooks
I've often wanted to try creating a word cloud and when Noel O'Boyle collected together all the tweets from the Sheffield Conf on Chemoinformatics this seemed a good opportunity.
Relive the Sheffield Conf on Chemoinformatics with these #shef2019 tweets I've pulled down from Twitter, link to tweet.
The Jupyter notebook used to create the word cloud is here, it uses the excellent word cloud generator word_cloud. You will need to download the text from the tweets from the link provided in the tweet.
If you use Binder to serve your Jupyter notebooks you will be interested in this.
Have a repository full of Jupyter notebooks? With Binder, open those notebooks in an executable environment, making your code immediately reproducible by anyone, anywhere
We flipped the switch on making mybinder.org 6 a federation. This means that there are now two clusters that serve requests for mybinder.org 6. What changes for you as a user? Hopefully nothing. You will notice that if you visit mybinder.org 6 (or any other link to it) you will be redirected to gke.mybinder.org 1 or ovh.mybinder.org 5. Beyond that small change everything should keep working as before
This should mean that Binder becomes more robust and not susceptible to outages. Now this is in place it should also be possible to add further server resources.
Just a reminder that support for Python 2.7 will end on Jan 31 2020 (there will be no 2.8), all major scientific packages now support Python 3.x and there will be no further updates the Python 2.x versions.
An increasing number of projects have pledged to drop support for Python 2.7 no later than 2020, these include pandas, RDKit, iPython, Matplotlib, NumPy, SciPy, BioPython, Psi4, scikit-learn, Tensorflow, Jupyter notebook and many more.
Time to update those old scripts and Jupyter notebooks.
CGRtools is a set of tools for processing of reactions based on Condensed Graph of Reaction (CGR) approach, details on Github https://github.com/cimm-kzn/CGRtools. Published in JCIM DOI
- Read /write /convert formats MDL .RDF and .SDF, SMILES, .MRV
- Standardize reactions and valid structures checker.
- Produce CGRs.
- Perfrom subgraph search.
- Build /correct molecules and reactions.
- Produce template based reactions.
stable version are available through PyPI
pip install CGRTools
Install CGRtools library DEV version for features that are not well tested
pip install -U git+https://github.com/cimm-kzn/CGRtools.git@master#egg=CGRtools
There is also a tutorial using Jupyter notebook https://github.com/cimm-kzn/CGRtools/tree/master/tutorial
I was recently asked for a way to visualise HELM notation
HELM (Hierarchical Editing Language for Macromolecules) enables the representation of a wide range of biomolecules such as proteins, nucleotides, antibody drug conjugates etc. whose size and complexity render existing small-molecule and sequence-based informatics methodologies impractical or unusable.
The RDKit provides limited support for HELM notation (currently peptide) and a simple Jupyter Notebook provides an easy interface as shown here
FPSim2 is a new tool for fast similarity search on big compound datasets (>100 million) being developed at ChEMBL. It was developed as a Python3 library to support either in memory or out-of-core fast similarity searches on such dataset sizes.
It is built using RDKit and can be installed using conda. It requires Python 3.6 and a recent version of RDKit..
I've written a couple of Jupyter notebooks to demonstrate it's use.
You can read the full tutorial here, and download the notebooks.
Small molecules can potentially bind to a variety of bimolecular targets and whilst counter-screening against a wide variety of targets is feasible it can be rather expensive and probably only realistic for when a compound has been identified as of particular interest. For this reason there is considerable interest in building computational models to predict potential interactions. With the advent of large data sets of well annotated biological activity such as ChEMBL and BindingDB this has become possible.
ChEMBL 24 contains 15,207,914 activity data on 12,091 targets, 2,275,906 compounds, BindingDB contains 1,454,892 binding data, for 7,082 protein targets and 652,068 small molecules.
These predictions may aid understanding of molecular mechanisms underlying the molecules bioactivity and predicting potential side effects or cross-reactivity.
Whilst there are a number of sites that can be used to predict bioactivity data I'm going to compare one site, Polypharmacology Browser 2 (PPB2) http://ppb2.gdb.tools with two tools that can be downloaded to run the predictions locally. One based on Jupyter notebooks models built using ChEMBL built by the ChEMBL group https://github.com/madgpap/notebooks/blob/master/targetpred21_demo.ipynb and a more recent random forest model PIDGIN. If you are using proprietary molecules it is unwise to use the online tools.
I'm constantly impressed by the expansion of Jupyter it is rapidly becoming the first-choice platform for interactive computing.
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.
Swift for TensorFlow is a new way to develop machine learning models. It gives you the power of TensorFlow directly integrated into the Swift programming language. With Swift, you can write the following imperative code, and Swift automatically turns it into a single TensorFlow Graph and runs it with the full performance of TensorFlow Sessions on CPU, GPU and TPU.
Requires MacOS 10.13.5 or later, with Xcode 10.0 beta or later
I always keep an eye out for the polls on KDnuggets, the latest one looks at Python editors or IDEs, over 1900 people took part and the results are shown below (users could select up to 3). There is more detail in the linked page.
I've become a great fan of Jupyter, and not only for Python.
I've been using Jupyter notebooks for a little while but I only just recently found out that you can embed LaTeX or MathML into a notebook!
This notebook is just a series of examples of what can be done. You can embed equations inline or have them on a separate line in a markdown text cell. Or in a code cell by importing Math or invoking latex.
Recently ChEMBL was updated to version 24 the update contains:
- 2,275,906 compound records
- 1,828,820 compounds (of which 1,820,035 have mol files)
- 15,207,914 activities
- 1,060,283 assays
- 12,091 targets
- 69,861 documents
In addition today they released the predictive models built on the updated database, they can be downloaded from the ChEMBL ftp server ftp://ftp.ebi.ac.uk/pub/databases/chembl/target_predictions
There are 1569 models.
A recent paper "The Catch-22 of Predicting hERG Blockade Using Publicly Accessible Bioactivity Data" DOI described a classification model for HERG activity. I was delighted to see that all the datasets used in the study, including the training and external datasets, and the models generated using these datasets were provided as individual data files (CSV) and Python Jupyter notebooks, respectively, on GitHub https://github.com/AGPreissner/Publications).
The models were downloaded and the Random Forest Jupyter Notebooks (using RDKit) modified to save the generated model using pickle to store the predictive model, and then another Jupyter notebook was created to access the model without the need to rebuild the model each time. This notebook was exported as a python script to allow command line access, and Vortex scripts created that allow the user to run the model within Vortex and import the results and view the most significant features.
All models and scripts are available for download.
Well after my last post about Swift and Jupyter a reader sent me link to the use of both Julia and Fortran programming languages in a Jupyter Notebook.
More information in this lecture Project Jupyter: Architecture and Evolution of an Open Platform for Modern Data Science by Fernando Perez.
Project Jupyter, evolved from the IPython environment, provides a platform for interactive computing that is widely used today in research, education, journalism and industry. The core premise of the Jupyter architecture is to provide tools for human-in-the-loop interactive computing. It provides protocols, file formats, libraries and user-facing tools optimized for the task of humans interactively exploring problems with the aid of a computer, combining natural and programming languages in a common computational narrative.
I'm a great fan of Jupyter Notebooks but I only ever use python.
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text
A recent post by Ray Yamamoto Hilton caught my eye who recently put together a little experiment to demonstrate using Swift 4.1 from within Jupyter Notebooks.
You can download a demo notebook here.
The RCSB Protein Data Bank is an absolutely invaluable resource that provides archive-information about the 3D shapes of proteins, nucleic acids, and complex assemblies that helps scientists understand all aspects of biomedicine and agriculture, from protein synthesis to health and disease. Currently the PDB contains over 134,000 data files containing structural information on 42547 distinct protein sequences of which 37600 are human sequences. They also provide a series of tools to search, view and analyse the data.
Downloading an individual pdf file is pretty trivial and can be done from the web page as shown in the image below. They also provide a Download Tool launched as stand-alone application using the Java Web Start protocol. The tool is downloaded locally and must be then opened. I've found this a little temperamental and had issues with Java versions and security settings.
Since I've been making extensive use of the web services to interact with RCSB I decided to explore the use of Python to download multiple files. I started off creating a Jupyter notebook using the web services provided by RCSB.
I've also used variations on this code to create a python script and a Vortex script.
I've become a great fan of Jupyter Notebooks as a way of modelling cheminformatics data, and I've published some of the notebooks here.
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.
In the predicting AMES activity notebook I also looked at the use of pickle to store the predictive model and then access it using a Jupyter notebook without the need to rebuild the model. Whilst a notebook is a nice way to access the predictive model it might also be useful to be able to access it from other applications or from the command line.
In this tutorial we look at providing command line access to the model and then incorporating it into a Vortex script.
I'm in the process of updating the Jupyter notebooks to Python3 and I looking at what I can do make sure other people can reproduce the results. At the moment I annotate the imported python modules with version numbers in the Jupyter notebook. Finding the versions is a bit tedious and I was wondering if there was some way to automate this?
from rdkit import Chem #rdkit 2016.03.5 from rdkit.Chem import PandasTools import pandas as pd #pandas==0.17.1 import pandas_ml as pdml #pandas-ml==0.4.0 from rdkit.Chem import AllChem, DataStructs import numpy #numpy==1.12.0 from sklearn.model_selection import train_test_split #scikit-learn==0.18.1 import subprocess from StringIO import StringIO import pickle import os %matplotlib inline
This guide is a set of Jupyter notebooks intended to help researchers already familiar with molecular dynamics simulation learn how to use OpenMM in their research and software projects.
# For Mac OS X, substitute `MacOSX` for `Linux` below wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh bash -b ./Miniconda3-latest-Linux-x86_64.sh -p $HOME/miniconda export PATH=$HOME/miniconda/bin:$PATH conda install --yes -c omnia -c conda-forge jupyter notebook openmm mdtraj nglview
There is a detailed document describing OpenMM here
OpenMM is a set of libraries that lets programmers easily add molecular simulation features to their programs, and an “application layer” that exposes those features to end users who just want to run simulations. Instructions for installation under MacOSX are here.
OpenMM works on Mac OS X 10.7 or later. OpenCL is supported on OS X 10.10.3 or later.
In the previous workflow I described docking a set of ligands with known activity into a target protein, in this workflow we will be using a set of ligands from the ZINC dataset searching for novel ligands. Once docked the workflow moves on to finding vendors and selecting subsets for purchase.
Whilst high-throughput screening (HTS) has been the starting point for many successful drug discovery programs the cost of screening, the lack of access to a large diverse sample collection, or the low throughput of the primary assay may preclude HTS as a starting point and identification of a smaller selection of compounds with a higher probability of being a hit may be desired. Directed or Virtual screening is a computational technique used in drug discovery research designed to identify potential hits for evaluation in primary assays. It involves the rapid in silico assessment of large libraries of chemical structures in order to identify those structures that most likely to be active against a drug target. The in silico screen can be based on known ligand similarity or based on docking ligands into the desired binding site.
I've updated the description to give more information about preparing the target protein.
This is a recording of the March 2017 Global Health Compound Design meeting. A webinar demonstrating using Jupyter, the free iPython notebook.
How to get started
Accessing Open Source Malaria data
Calculating physicochemical properties and plotting
Predicting AMES activity.
I've now written a couple of Jupyter notebooks and one of the issues that has come up is how to share the notebooks in a way that ensures the results will be reproducible in an environment when updates to components occur regularly.
Binder is a collection of tools for building and executing version-controlled computational environments that contain code, data, and interactive front ends, like Jupyter notebooks. It's 100% open source.
At a high level, Binder is designed to make the following workflow as easy as possible
- Users specify a GitHub repository
- Repository contents are used to build Docker images
- Deploy containers on-demand in the browser on a cluster running Kubernetes
Common use cases include:
- sharing scientific work
- sharing journalism
- running tutorials and demos with minimal setup
- teaching courses
If you want to find out more have a look at this blog post by the developers.
I've been experimenting with the use of Jupyter Notebooks (aka iPython Notebooks) as an electronic lab notebook but also a means to share computational models. The aim would be to see how easy it would be to share a model together with the associated training data together with an explanation of how the model was built and how it can be used for novel molecules.
The Ames test is a widely employed method that uses bacteria to test whether a given chemical can cause mutations in the DNA of the test organism. More formally, it is a biological assay to assess the mutagenic potential of chemical compounds. PNAS. 70 (8): 2281–5. doi
In this first notebook a random forest model to predict AMES activity is described….
The Molecular Design Toolkit is an open source environment that aims to seamlessly integrated molecular simulation, visualization and cloud computing. It offers access to a large and still-growing set of computational modelling methods with a science-focused Python API, that can be easily installed using PIP. It is ideal for building into a Jupyter notebook. The API is designed to handle both small molecules and large bimolecular structures, molecular mechanics and QM calculations.
There are a series of Youtube videos describing some of the functionality in more details, starting with this introduction.
This blog post looks very interesting, a notebook environment for coding, data visualisation based on Juypter (aka iPython) notebooks
With nteract, you can create documents, that contain executable code, textual content, and images, and convey a computational narrative. Unlike Jupyter, your documents are stand-alone, cross-platform desktop applications, providing a seamless desktop experience and offline usage.
More details are on GitHub.