Macs in Chemistry

Insanely great science

 

Open Source Python Data Science Libraries

When I wrote the article entitled A few thoughts on scientific software one of the responses I got was that people did not know about the existence of open-source chemistry toolkits so I thought I'd publish a page that hopefully prevent stop people reinventing the wheel. Here are a few open-source cheminformatics toolkits that I'm aware of.

As a follow up I thought I'd put together a list of useful python libraries for data science

If you have installed Anaconda a number of these packages will be preinstalled, however the fastest way to obtain conda is to install Miniconda, a minimal version of Anaconda that includes only conda and its dependencies. You can then use

conda install

to install specific packages from the Anaconda repository. An alternative python package manager is PIP https://pypi.org/project/pip/, to install packages use

pip install

Pandas

https://pandas.pydata.org

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language, pandas provides fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. There are over 1300 contributors on GitHub.

It can be installed using conda

conda install pandas

Or PIP

pip install pandas

pandas requiers: NumPy: 1.9.0 or higher python-dateutil: 2.5.0 or higher pytz: 2011k or higher

There is extensive documentation

License Open source - BSD license
Source code; https://github.com/pandas-dev/pandas
Mailing list; https://groups.google.com/forum/?fromgroups#!forum/pydata
Stackoverflow: https://stackoverflow.com/questions/tagged/pandas

NumPy

https://www.numpy.org

NumPy is the fundamental package for scientific computing with Python. It contains among other things, a powerful N-dimensional array object, sophisticated (broadcasting) functions, tools for integrating C/C++ and Fortran code, useful linear algebra, Fourier transform, and random number capabilities. There are over 700 contributors on GitHub.

It can be installed using conda

conda install numpy

It can be installed using PIP

pip install --user numpy

There is extensive documentation

License: BSD
Source Code : https://github.com/numpy/numpy
Mailing List: https://mail.python.org/mailman/listinfo/numpy-discussion

SciPy

https://www.scipy.org/index.html

SciPy (pronounced "Sigh Pie") is open-source software for mathematics, science, and engineering. It includes modules for statistics, optimization, integration, linear algebra, Fourier transforms, signal and image processing, ODE solvers, and more. SciPy depends on NumPy, which provides convenient and fast N-dimensional array manipulation. SciPy is built to work with NumPy arrays, and provides many user-friendly and efficient numerical routines such as routines for numerical integration and optimization. There are over 700 contributors on GitHub.

It can be installed using conda

conta install scipy

It can be installed using PIP

pip install --user scipy

SciPy requires the following software installed for your platform: Python 2.7 or >= 3.4 NumPy >= 1.8.2

There is extensive documentation

License; BSD
Source Code: https://github.com/scipy/scipy
Mailing List ; https://scipy.org/scipylib/mailing-lists.html

Scikit-learn

https://scikit-learn.org/stable/
Scikit-learn is written in Python and is a library for machine learning built on NumPy, SciPy and matplotlib. It provides a very wide variety of tools for data mining and data analysis with a focus on machine learning. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN. There are over 12,000 contributors of GitHub, the project was started in 2007 by David Cournapeau as a Google Summer of Code project.

Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research 12 (2011) 2825-2830

Scikit-learn requires: Python (>= 2.7 or >= 3.4), NumPy (>= 1.8.2), SciPy (>= 0.13.3).

It should be noted Scikit-learn 0.20 is the last version to support Python 2.7 and Python 3.4. Scikit-learn 0.21 will require Python 3.5 or newer.

It can be installed using conda

conda install scikit-learn

or PIP

pip install -U scikit-learn

There is extensive documentation and a number of tutorials.

Also worth looking at sklearn-pandas a bridge between Scikit-Learn's machine learning methods and pandas-style Data Frames.

License Open source, commercially usable - BSD license
Source code https://github.com/scikit-learn/scikit-learn Mailing list : https://mail.python.org/mailman/listinfo/scikit-learn
Also stackoverflow : https://stackoverflow.com/questions/tagged/scikit-learn

PyTorch

https://pytorch.org

PyTorch is a Python package that provides two high-level features:

You can reuse your favorite Python packages such as NumPy, SciPy and Cython to extend PyTorch when needed. Currently in an early-release beta. Expect some adventures and rough edges. There are over 800 contributors of GitHub

It can be installed using conda

conda install pytorch torchvision -c pytorch

or PIP

pip3 install torch torchvision

Or built from source You will need to build from source if you want CUDA support.

License; BSD-style license.
Source code: https://github.com/pytorch/pytorch
Mailing list : https://discuss.pytorch.orgs

Tensorflow

https://www.tensorflow.org

TensorFlow™ is an open source software library for high performance numerical computation. Its flexible architecture allows easy deployment of computation across a variety of platforms (CPUs, GPUs, TPUs), and from desktops to clusters of servers to mobile and edge devices. There are over 1700 contributors on GitHub.

To install the current release for CPU-only:

pip install tensorflow

Use the GPU package for CUDA-enabled GPU cards:

pip install tensorflow-gpu

Docker images are also available https://hub.docker.com/r/tensorflow/tensorflow/.

License: Apache License 2.0 Source code: https://github.com/tensorflow/tensorflow
Mailing List: https://groups.google.com/a/tensorflow.org/forum/#!forum/discuss
Stackoverflow: https://stackoverflow.com/questions/tagged/tensorflow

Keras

https://keras.io
Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. There are over 700 contributors of GitHub

It can be installed using PIP

pip install keras

Or built from source

There is extensive documentation.

You can use Sequential Keras models (single-input only) as part of your Scikit-Learn workflow via the wrappers https://keras.io/scikit-learn-api/.

License; MIT license.
Source code: https://github.com/keras-team/keras
Mailing list : https://groups.google.com/forum/#!forum/keras-users

xgbost

https://xgboost.ai

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples. There are over 300 contributors of GitHub.

XGBoost: A Scalable Tree Boosting System DOI

It can be installed using PIP

First, obtain gcc-7 with Homebrew (https://brew.sh/) to enable multi-threading (i.e. using multiple CPU threads for training). The default Apple Clang compiler does not support OpenMP, so using the default compiler would have disabled multi-threading.

pip3 install xgboost

Or built from source

License; Licensed under an Apache-2 license.
Source code: https://github.com/dmlc/xgboost
Mailing list : https://discuss.xgboost.ai

statsmodels

https://www.statsmodels.org/stable/

statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration. There are 164 contributors on GitHub.

Statsmodels: Econometric and statistical modeling with python.” Proceedings of the 9th Python in Science Conference. 2010.PDF

It can be installed using conda

conda install statsmodels

Or PIP

pip install --upgrade --no-deps statsmodels

Or you can build from source.

There is extensive Documentation

License; open source Modified BSD (3-clause)
Source code: https://github.com/statsmodels/statsmodels
Mailing list : https://groups.google.com/forum/#!forum/pystatsmodels

Matplotlib

https://matplotlib.org

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits. There are nearly 800 contributors on GitHub. NOTE: The current master branch is now Python 3 only. Python 2 support is being dropped.

It can be installed using PIP

pip install -U matplotlib

Matplotlib requires the following dependencies:

Python (>= 3.5) FreeType (>= 2.3) libpng (>= 1.2) NumPy (>= 1.10.0) setuptools cycler (>= 0.10.0) dateutil (>= 2.1) kiwisolver (>= 1.0.0) pyparsing

License: Python Software Foundation (PSF) license.
Source code: https://github.com/matplotlib/matplotlib
Mailing list: matplotlib-users@python.org

Seaborn

https://seaborn.pydata.org

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics, it is closely integrated with pandas.

There is a really comprehensive set of tutorials

It can be installed using conda

conda install seaborn

Or PIP

pip install seaborn

Seaborn requires: numpy (>= 1.9.3) scipy (>= 0.14.0) matplotlib (>= 1.4.3) pandas (>= 0.15.2)

License: BSD 3-clause license Source code: https://github.com/mwaskom/seaborn Mailing List: https://stackoverflow.com/questions/tagged/seaborn

Jupyter

http://jupyter.org

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more. While Jupyter runs code in many programming languages, Python is a requirement (Python 3.3 or greater, or Python 2.7) for installing the Jupyter Notebook itself

It can be installed using conda

conda instal jupyter

Or PIP

pip install jupyter

There is extensive documentation

License: modified BSD license
Stackoverflow: https://stackoverflow.com/questions/tagged/jupyter

I thought I'd also mention collections it is in the standard library but I seem to use default_dict regularly.

Last Updated 29 November 2018