Macs in Chemistry

Insanely great science


Open Source Python Data Science Libraries

When I wrote the article entitled A few thoughts on scientific software one of the responses I got was that people did not know about the existence of open-source chemistry toolkits so I thought I'd publish a page that hopefully prevent stop people reinventing the wheel. Here are a few open-source cheminformatics toolkits that I'm aware of.

As a follow up I thought I'd put together a list of useful python libraries for data science

If you have installed Anaconda a number of these packages will be preinstalled, however the fastest way to obtain conda is to install Miniconda, a minimal version of Anaconda that includes only conda and its dependencies. You can then use

conda install

to install specific packages from the Anaconda repository. An alternative python package manager is PIP, to install packages use

pip install


pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language, pandas provides fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. There are over 1300 contributors on GitHub.

It can be installed using conda

conda install pandas


pip install pandas

pandas requiers: NumPy: 1.9.0 or higher python-dateutil: 2.5.0 or higher pytz: 2011k or higher

There is extensive documentation

License Open source - BSD license
Source code;
Mailing list;!forum/pydata


Modin is a library designed to accelerate Pandas by automatically distributing the computation across all of the system’s available CPU cores. Modin uses Ray to provide an effortless way to speed up your pandas notebooks, scripts, and libraries. Unlike other distributed DataFrame libraries, Modin provides seamless integration and compatibility with existing pandas code. Even using the DataFrame constructor is identical. Modin is a DataFrame designed for datasets from 1MB to 1TB+

It can be installed using PIP

pip install modin

If you don't have Ray or Dask installed, you will need to install Modin with one of the targets:

pip install modin[ray] # Install Modin dependencies and Ray to run on Ray
pip install modin[dask] # Install Modin dependencies and Dask to run on Dask
pip install modin[all] # Install all of the above

Currently, Modin depends on pandas version 0.23.4.

License: Apache 2.0 Source Code: Mailing list:!forum/modin-dev


NumPy is the fundamental package for scientific computing with Python. It contains among other things, a powerful N-dimensional array object, sophisticated (broadcasting) functions, tools for integrating C/C++ and Fortran code, useful linear algebra, Fourier transform, and random number capabilities. There are over 700 contributors on GitHub.

It can be installed using conda

conda install numpy

It can be installed using PIP

pip install --user numpy

There is extensive documentation

License: BSD
Source Code :
Mailing List:


SciPy (pronounced "Sigh Pie") is open-source software for mathematics, science, and engineering. It includes modules for statistics, optimization, integration, linear algebra, Fourier transforms, signal and image processing, ODE solvers, and more. SciPy depends on NumPy, which provides convenient and fast N-dimensional array manipulation. SciPy is built to work with NumPy arrays, and provides many user-friendly and efficient numerical routines such as routines for numerical integration and optimization. There are over 700 contributors on GitHub.

It can be installed using conda

conta install scipy

It can be installed using PIP

pip install --user scipy

SciPy requires the following software installed for your platform: Python 2.7 or >= 3.4 NumPy >= 1.8.2

There is extensive documentation

License; BSD
Source Code:
Mailing List ;

Scikit-learn is written in Python and is a library for machine learning built on NumPy, SciPy and matplotlib. It provides a very wide variety of tools for data mining and data analysis with a focus on machine learning. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN. There are over 12,000 contributors of GitHub, the project was started in 2007 by David Cournapeau as a Google Summer of Code project.

Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research 12 (2011) 2825-2830

Scikit-learn requires: Python (>= 2.7 or >= 3.4), NumPy (>= 1.8.2), SciPy (>= 0.13.3).

It should be noted Scikit-learn 0.20 is the last version to support Python 2.7 and Python 3.4. Scikit-learn 0.21 will require Python 3.5 or newer.

It can be installed using conda

conda install scikit-learn

or PIP

pip install -U scikit-learn

There is extensive documentation and a number of tutorials.

Also worth looking at sklearn-pandas a bridge between Scikit-Learn's machine learning methods and pandas-style Data Frames.

License Open source, commercially usable - BSD license
Source code Mailing list :
Also stackoverflow :


PyTorch is a Python package that provides two high-level features:

You can reuse your favorite Python packages such as NumPy, SciPy and Cython to extend PyTorch when needed. Currently in an early-release beta. Expect some adventures and rough edges. There are over 800 contributors of GitHub

It can be installed using conda

conda install pytorch torchvision -c pytorch

or PIP

pip3 install torch torchvision

Or built from source You will need to build from source if you want CUDA support.

License; BSD-style license.
Source code:
Mailing list : https://discuss.pytorch.orgs


TensorFlow™ is an open source software library for high performance numerical computation. Its flexible architecture allows easy deployment of computation across a variety of platforms (CPUs, GPUs, TPUs), and from desktops to clusters of servers to mobile and edge devices. There are over 1700 contributors on GitHub.

To install the current release for CPU-only:

pip install tensorflow

Use the GPU package for CUDA-enabled GPU cards:

pip install tensorflow-gpu

Docker images are also available

License: Apache License 2.0 Source code:
Mailing List:!forum/discuss

Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. There are over 700 contributors of GitHub

It can be installed using PIP

pip install keras

Or built from source

There is extensive documentation.

You can use Sequential Keras models (single-input only) as part of your Scikit-Learn workflow via the wrappers

License; MIT license.
Source code:
Mailing list :!forum/keras-users


XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples. There are over 300 contributors of GitHub.

XGBoost: A Scalable Tree Boosting System DOI

It can be installed using PIP

First, obtain gcc-7 with Homebrew ( to enable multi-threading (i.e. using multiple CPU threads for training). The default Apple Clang compiler does not support OpenMP, so using the default compiler would have disabled multi-threading.

pip3 install xgboost

Or built from source

License; Licensed under an Apache-2 license.
Source code:
Mailing list :


statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration. There are 164 contributors on GitHub.

Statsmodels: Econometric and statistical modeling with python.” Proceedings of the 9th Python in Science Conference. 2010.PDF

It can be installed using conda

conda install statsmodels


pip install --upgrade --no-deps statsmodels

Or you can build from source.

There is extensive Documentation

License; open source Modified BSD (3-clause)
Source code:
Mailing list :!forum/pystatsmodels


pyjanitor is a project that extends Pandas with a verb-based API, providing convenient data cleaning routines for repetitive tasks.

It can be installed using conda

conda install pyjanitor -c conda-forge


pip install pyjanitor

There is extensive Documentation, including a section on cleaning chemistry data.


Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits. There are nearly 800 contributors on GitHub. NOTE: The current master branch is now Python 3 only. Python 2 support is being dropped.

It can be installed using PIP

pip install -U matplotlib

Matplotlib requires the following dependencies:

Python (>= 3.5) FreeType (>= 2.3) libpng (>= 1.2) NumPy (>= 1.10.0) setuptools cycler (>= 0.10.0) dateutil (>= 2.1) kiwisolver (>= 1.0.0) pyparsing

License: Python Software Foundation (PSF) license.
Source code:
Mailing list:


Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics, it is closely integrated with pandas.

There is a really comprehensive set of tutorials

It can be installed using conda

conda install seaborn


pip install seaborn

Seaborn requires: numpy (>= 1.9.3) scipy (>= 0.14.0) matplotlib (>= 1.4.3) pandas (>= 0.15.2)

License: BSD 3-clause license Source code: Mailing List:


The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more. While Jupyter runs code in many programming languages, Python is a requirement (Python 3.3 or greater, or Python 2.7) for installing the Jupyter Notebook itself

It can be installed using conda

conda instal jupyter


pip install jupyter

There is extensive documentation

License: modified BSD license

I thought I'd also mention collections it is in the standard library but I seem to use default_dict regularly.

Last Updated 9 January 2020