Macs in Chemistry

Insanely Great Science

Machine learning

Scikit-LLM: Sklearn Meets Large Language Models

I just stumbled across this and I thought I’d share it.

Seamlessly integrate powerful language models like ChatGPT into scikit-learn for enhanced text analysis tasks.

It can be installed using PIP.

pip install scikit-llm

Full details about the project are on GitHub

You might also be interested in.

FALCON: A Lightweight AutoML Library

Falcon is a lightweight python library that allows to train production-ready machine learning models in a single line of code.

Falcon is a simple and lightweight AutoML library designed for people who want to train a model on a custom dataset in an instant even without specific data-science knowledge. Simply give Falcon your dataset and specify which feature you want the ML model to predict. Falcon will do the rest!

Falcon allows the trained models to be immediately used in production by saving them in the widely used ONNX format. No need to write custom code to save complicated models to ONNX anymore!

pip install falcon-ml

Full details are on GitHub


In a recent post Pat Walters highlighted the use of molfeat in a google colab notebook

I thought I'd also mention other tools available from is an open-source toolkit that simplifies molecular processing and featurization workflows for ML scientists in drug discovery.

Cheminformatics support is all built upon the open-source toolkit RDKit It can be installed using conda

conda install -c conda-forge datamol

Or pip

pip install datamol

The latest version (0.9) appears to need Python 3.9 and RDKit version [2022.03, 2022.09]

There is a comprehensive series of tutorials and an extensive documentation.

License is Apache version 2.0.

If you would like to contribute details are on GitHub


6th RSC-BMCS/RSC-CICAG Artificial Intelligence in Chemistry

The dates for the 6th RSC-BMCS/RSC-CICAG Artificial Intelligence in Chemistry are now set, be sure to put them in your calendar. The call for oral and poster abstracts will be opening very soon as will registration. This will be a hybrid meeting with both in person and virtual attendance options. #AIChem23.


Spread the word, this is always a fantastic meeting and there are already a number of great speakers in place

Confirmed Speakers:

Andrew White, University of Rochester and VIAL, US
Kathryn Furnival, AstraZeneca, UK
Laksh Aithani, Charm Therapeutics, UK
Michael Bronstein, University of Oxford, UK
Michelle L. Gill, NVIDIA, US
Noor Shaker, Glamorous AI and X-Chem, UK

Conference website is



I first mentioned alvaScience in a review of alvaDesc I wrote in 2019 Since then I've mentioned new products as they have been launched but only recently have I gone back to the website to look at things in more detail.

At alvaScience, we are constantly exploring and implementing the most promising and innovative technologies in our software tools, which makes them a leading choice for QSAR and other cheminformatics research.

alvaModel is an interesting software tool to create Quantitative Structure Activity/Property Relationship (QSAR/QSPR) models using the descriptors and fingerprints calculated in alvaDesc. It comes with a variety of very useful tools for data and descriptors, such as feature reduction and a variety of machine learning tools

Regression model

  • Ordinary Least Squares (OLS) model
  • Partial Least Squares (PLS) model
  • KNN regression model
  • Support Vector Machine (SVM) model
  • Consensus model defined as the arithmetic mean of the values predicted by the selected models

Classification model

  • Linear and Quadratic Discriminant Analysis (LDA/QDA) model
  • Partial Least Squares Discriminant Analysis (PLS-DA) model
  • KNN classification model
  • Support Vector Machine (SVM) model
  • Consensus model defined assigning the class based on the majority of the values predicted by the selected models

Whilst building models is one thing being able to deploy them easily is something else. alvaRunner helps with this. alvaRunner can be accessed via the command line but I suspect many users will use the graphical interface. Using the GUI, for every imported molecule, you can see the predicted targets and whether the molecule is inside or outside the defined model’s Applicability Domain and you can sort and filter any column by right-clicking the corresponding column header.


alvaScience will be at the RSC-SCI Workshop on Computational Tools for Drug Discovery 2022, if you would like to try it out why not come along.


Setting up ML and AI tools on Apple Silicon

I've had a number of questions about setting up a machine learning/artificial intelligence environment on an Apple Silicon Mac. So I've tried to write a step by step guide.

Setting up ML and AI tools on Apple Silicon, using home-brew and conda to install and manage compatibility and dependences.

I've also created a .yml file that you can use instead of going through all the steps.

There are a couple of example Jupyter notebooks that give a starting point for trying things out.

I'm very much aware that this is a bit of a moving target at the moment so comments/suggestions are much appreciated.


1st EUOS/SLAS Joint Challenge: Compound Solubility

The latest kaggle challenge is up.

Develop new methods to predict compound solubility based on chemical structure.

EU-OPENSCREEN ERIC and SLAS challenge you to develop a reliable algorithm that can predict the solubility of a small molecule, an essential feature of all biologically active compounds. EU-OPENSCREEN ERIC provides a high-quality data set of experimentally measured aqueous solubility of about 100,000 small molecules which was produced at an EU-OPENSCREEN ERIC high throughput screening partner site. 70,000 of these molecules will be available for download on Kaggle, and the residual 30,000 compounds will be withheld for prediction.

Full details are here


PyTorch on Apple Silicon


Latest nightly build.


Performance of PyTorch on Apple Silicon


A really useful blog post on PyTorch on Apple Silicon


PyTorch on Apple Silicon


PyTorch is now available on Apple Silicon

In collaboration with the Metal engineering team at Apple, we are excited to announce support for GPU-accelerated PyTorch training on Mac. Until now, PyTorch training on Mac only leveraged the CPU, but with the upcoming PyTorch v1.12 release, developers and researchers can take advantage of Apple silicon GPUs for significantly faster model training. This unlocks the ability to perform machine learning workflows like prototyping and fine-tuning locally, right on Mac.

To get started, just install the latest Preview (Nightly) build on your Apple silicon Mac running macOS 12.3 or later with a native version (arm64) of Python


AI/ML on Apple Silicon


A GitHub repository giving details of how to set up an Apple M1 machine for data science., includes a series of test scripts for benchmarking.

There is a M1 Max VS RTX3070 Tensorflow Performance Tests here.


AI3SD Autumn Seminar Series


AI3SD have just announced the Autumn Seminar Series!

The event on the 3rd November is now open for registration: AI & ML 4 Drugs & Materials - 13:00-15:45

This event consists of three talks: 

Combining robotics and Machine Learning for accelerated drug discovery – Dr Tom Fleming (Arctoris)

Abstract: Artificial intelligence has an increasing impact on drug discovery and development, offering opportunities to identify novel targets, hit, and lead-like compounds in accelerated timeframes. However, the success of any AI/ ML model depends on the quality of the input data, and the speed with which in silico predictions can be validated in vitro. The talk will cover laboratory automation and robotics and the benefits they offer in terms of quality and speed of data generation synergise with AI/ ML-powered drug discovery approaches. The talk will cover some of the general trends in the industry, and also highlight successfully implemented case studies that show the how the combination of robotics and AI/ ML lead to accelerated project timelines and superior research outputs.

Bio: Tom Fleming MChem is the COO of biotech platform company Arctoris, which he co-founded in Oxford in 2016. Tom’s background is in cancer research, having worked in academia as well as at leading CROs and pharmaceutical corporations. A chemical biologist by training, he has unique insights into preclinical drug discovery, including the critical steps from target identification and high-throughput screening up to lead optimization. Tom was a Fellow of the Royal Commission of 1851 at the University of Oxford, and is a SME Leader of the Royal Academy of Engineering.

Machine Learning and AI for Drug Design – Professor Ola Engkvist (AstraZeneca & Chalmers University)

Abstract: Artificial Intelligence has become impactful during the last few years in chemistry and the life sciences, pushing the scientific boundaries forward as exemplified by the recent success of AlphaFold2. In this presentation I will provide an overview of how AI have impacted drug design in the last few years, where we are now and what progress we can reasonably expect in the coming years. The presentation will have a focus on deep learning based molecular de novo design, however, also aspects of synthesis prediction, molecular property predictions and chemistry automation will be covered.

Bio: Dr Ola Engkvist is head of Molecular AI in Discovery Sciences, AstraZeneca R&D. He did his PhD in computational chemistry at Lund University followed by a postdoc at Cambridge University. After working for two biotech companies he joined AstraZeneca in 2004. He currently lead the Molecular AI department, where the focus is to develop novel methods for ML/AI in drug design , productionalize the methods and apply the methods to AstraZeneca’s small molecules drug discovery portfolio. His main research interests are deep learning based molecular de novo design, synthetic route prediction and large scale molecular property predictions. He has published over 100 peer-reviewed scientific publications. He is adjunct professor in machine learning and AI for drug design at Chalmers University of Technology and a trustee of Cambridge Crystallographic Data Center. 

Accelerating design of organic materials with machine learning and AI – Professor Olexandr Isayev (Carnegie Mellon University)

Abstract: Deep learning is revolutionizing many areas of science and technology, particularly in natural language processing, speech recognition, and computer vision. In this talk, we will provide an overview of the latest developments of machine learning and AI methods and application to the problem of drug discovery and development at Isayev’s Lab at CMU. We identify several areas where existing methods have the potential to accelerate materials research and disrupt more traditional approaches. First we will present a deep learning model that approximates the solution of Schrodinger equation. We introduce the AIMNet-NSE (Neural Spin Equilibration) architecture, which can predict molecular energies for an arbitrary combination of molecular charge and spin multiplicity. The AIMNet-NSE model allows to fully bypass QM calculations and derive the ionization potential, electron affinity, and conceptual Density Functional Theory quantities like electronegativity, hardness, and condensed Fukui functions. We show that these descriptors, along with learned atomic representations, could be used to model chemical reactivity through an example of regioselectivity in electrophilic aromatic substitution reactions. Second, we proposed a novel ML-guided materials discovery platform that combines synergistic innovations in automated flow synthesis and automated machine learning (AutoML) method development. A software-controlled, continuous polymer synthesis platform enables rapid iterative experimental–computational cycles that resulted in the synthesis of hundreds of unique copolymer compositions within a multi-variable compositional space. The non-intuitive design criteria identified by ML, which was accomplished by exploring less than 0.9% of overall compositional space, upended conventional wisdom in the design of 19F MRI agents and led to the identification of >10 copolymer compositions that outperformed state-of-the-art materials.

Bio: Olexandr Isayev is an Assistant Professor at the Department of Chemistry at Carnegie Mellon University. In 2008, Olexandr received his Ph.D. in computational chemistry. He was Postdoctoral Research Fellow at the Case Western Reserve University and a scientist at the government research lab. During 2016-2019 he was a faculty at UNC Eshelman School of Pharmacy, the University of North Carolina at Chapel Hill. Olexandr received the “Emerging Technology Award” from the American Chemical Society (ACS) and the GPU computing award from NVIDIA. The research in his lab focuses on connecting artificial intelligence (AI) with chemical sciences.


Machine Learning in Chemistry meeting


This looks a really impressive line up of speakers.

Machine Learning in Chemistry Friday, Oct 29 11:00 AM - 1:00 PM EDT Details and registration.

11:00 - 11:30 AM A Star Wars character beats Quantum Chemistry! A neural network accelerating molecular calculations. Adrian Roitberg, University of Florida

11:30 AM - 12:00 PM Machine learning energy gaps of molecules in the condensed phase for linear and nonlinear optical spectroscopy. Christine Isborn, UC Merced

12:00 - 12:30 PM Accelerated molecular design and synthesis for drug discovery. Connor Coley, MIT

12:30 - 1:00 PM More than mimicry? The challenges of teaching chemistry to deep models. Brett Savoie, Purdue University


AI3SD & RSC-CICAG Protein Structure Prediction Conference


Registration is now open for the AI3SD & RSC-CICAG Protein Structure Prediction Conference. This online event looks like it will a brilliant meeting with a fantastic lineup of speakers. June 16 @ 9:45 am - June 17 @ 5:00 pm Free

Registration here Eventbrite Link:

The challenge of protein structure prediction has advanced significantly in recent years, yet translation into impact, particularly in drug discovery, remains open. Furthermore, while we as a community have advanced in predicting protein structures, they offer only static snapshots, and do not yet consider effectively the protein dynamics and conformational change. Bringing together scientists working in this area, and those who work with the resulting data, this conference is intended as a pulse check on the status of the field and where we will start seeing impact and improvements for human benefit. The two days will contain a number of talks from speakers who are key opinion leaders in the field, together with an opportunity to present short talks and posters to a wider audience. Day 1 will finish with an online social event (separate links will be sent out to register for this closer to the time) and Day 2 and will close with a panel discussion by the speakers, which is intended to be provocative.

Current invited speakers include: Professor John Moult (University of Maryland), Dr Chris De Graaf (Sosei Heptares), Professor Debora Marks (Harvard University), Professor Cecilia Clementi (Freie Universität Berlin), Professor Aleksej Zelezniak (Chalmers University of Technology), Dr Oscar Méndez-Lucio (Janssen Pharmaceuticals), Professor Charlotte Deane (University of Oxford), Professor Tudor Oprea (University of New Mexico), Dr Derek Lowe (Novartis), Professor Stephen Burley (RCSB PDB, Rutgers University, USCD).

Conference web page is here.


AI 4 Proteins Seminar Series 2021


The AI 4 Proteins Seminar Series 2021 is now in full swing, the first two presentations by Lucy Colwell and Melanie Vollmar were really brilliant and are now on the CICAG YouTube channel

You can find out more about the forthcoming events in the series here

The final event is 2 day meeting on Protein Structure Prediction. This looks like it will be a great meeting with a fantastic lineup. Current invited speakers include: Professor Debora Marks (Harvard University), Professor Cecilia Clementi (Freie Universität Berlin), Professor Charlotte Deane (University of Oxford), Professor Tudor Oprea (University of New Mexico), Dr Derek Lowe (Novartis) and Professor Stephen Burley (RCSB PDB, Rutgers University, USCD).

There is still time to submit abstracts for short talks and posters.

Short Talk Abstract Submission Form. Deadline: 29/04/2021. Notification of Acceptance: 06/05/2021.
Poster Abstract Submission Form. Deadline: 29/04/2021. Notification of Acceptance: 06/05/2021.
Poster & Video Guidelines for Accepted Posters

Full details are here


Mapping chemical reaction space


Really nice paper looking at reaction classification based on text description, and visualisation using reaction fingerprints. Mapping the space of chemical reactions using attention-based neural networks. DOI

Can be installed using conda

All code is on GitHub



Ensemble learning in Cheminformatics


Yet another invaluable post on cheminformatics and machine learning Python package for Ensemble learning #Chemoinformatics #Scikit learn.

Ensemble learning sometime outperform than single model. So it is useful for try to use the method. Fortunately now we can use ensemble learning very easily by using a python package named ‘mlens‘

Install using PIP

pip install mlens

ML-Ensemble (mlens) is an open-source high performance ensemble learning package written in Python, code is available on GitHub

ML-Ensemble combines a Scikit-learn high-level API with a low-level computational graph framework to build memory efficient, maximally parallelized ensemble networks in as few lines of codes as possible.


Autocompletion with deep learning


This looks really interesting

TabNine is an autocompleter that helps you write code faster by adding a deep learning model which significantly improves suggestion quality. You can see videos at the link above.

There has been a lot of hype about deep learning in the past few years. Neural networks are state-of-the-art in many academic domains, and they have been deployed in production for tasks such as autonomous driving, speech synthesis, and adding dog ears to human faces. Yet developer tools have been slow to benefit from these advances

Deep TabNine is trained on around 2 million files from GitHub. During training, its goal is to predict each token given the tokens that come before it. To achieve this goal, it learns complex behaviour, such as type inference in dynamically typed languages.

An interesting idea, my only concern is the quality of code in the training set.


AI in Chemistry bursaries still available


The 2nd RSC-BMCS / RSC-CICAG Artificial Intelligence in Chemistry meetings is filling up fast, however there are still 6 bursaries unallocated. The closing date for applications is 15 July. The bursaries are available up to a value of £250, to support registration, travel and accommodation costs for PhD and post-doctoral applicants studying at European academic institutions.

You can find details here

Twitter hashtag - #AIChem19


Python leads the 11 top Data Science, Machine Learning platforms


The results of the latest KDnuggets poll, which is in it's 20th year, are in. Python is clearly moving to become the dominant platform with the votes for R slowly declining.


The blog post on KDnuggets gives far more detailed analysis and is well worth reading.


Can we trust Published data?

I posted a poll on twitter

Looking at abstracts for the AI in Chemistry Meeting … many mine published data. The quality of the public data is obviously critical for good models. Is this something the AI community should be concerned about or get involved with to improve the quality of the literature?

The results are now in and interestingly despite nearly 2.5K impressions only 28 people voted. Of those that voted the overwhelming majority feel that AI scientists should help to improve the quality of the literature.


The comments associated with the tweet are interesting, certainly many machine learning models are robust enough to accommodate some poor data but I think there is a deeper concern.

Elisabeth Bik has regularly flagged questionable publications, unfortunately these are not always detected before their influence has been propagated through the literature.

For a very detailed example look at 5-HTTLPR: A POINTED REVIEW looking at an unusual version of the serotonin transporter gene 5-HTTLPR.

I've heard of many examples of scientists being unable to reproduce literature findings, usually little happens, however Amgen were able to reproduce only 6 out of 53 'landmark' studies and they published their findings.

How many times do scientists assume failure to reproduce published findings is their error?

There have been several studies looking at the possible causes of the failure to reproduce work, in 2011, an evaluation of 246 antibodies used in epigenetic studies found that one-quarter failed tests for specificity, meaning that they often bound to more than one target. Four antibodies were perfectly specific — but to the wrong target Reproducibility crisis: Blame it on the antibodies.

See also "The antibody horror show: an introductory guide for the perplexed" DOI

Colourful as this may appear, the outcomes for the community are uniformly grim, including badly damaged scientific careers, wasted public funding, and contaminated literature.

If you are mining literature data to predict novel drug targets then Caveat emptor.



Special Issue "Machine Learning with Python"


I was just sent details of a Special Issue "Machine Learning with Python for the journal Information.

We live in this day and age where quintillions of bytes of data are generated and collected every day. Around the globe, researchers and companies are leveraging these vast amounts of data in countless application areas, ranging from drug discovery to improving transportation with self-driving cars.As we all know, Python evolved into the lingua franca of machine learning and artificial intelligence research over the last couple of years. What makes Python particularly attractive for us researchers is that it gives us access to a cohesive set of tools for scientific computing and is easy to teach and learn. Also, as a language that bridges many different technologies and different fields, Python fosters interdisciplinary collaboration. And besides making us more productive in our research, sharing tools we develop in Python has the potential to reach a wide audience and benefit the broader research community.

This special issue is now open for submission.


Can we trust published data


A Twitter poll can we trust published data and should AI community be involved?

Looking at abstracts for … many mine published data. The quality of the public data is obviously critical for good models.


2nd RSC-BMCS / RSC-CICAG Artificial Intelligence in Chemistry


In June 2018 the First RSC-BMCS / RSC-CICAG Artificial Intelligence in Chemistry meeting was held in London. This proved to enormously popular, there were more oral abstracts and poster submissions than we had space for and was so over-subscribed we could have filled a venue double the size.

Planning for the second meeting is now in full swing, and it will be held in Cambridge 2-3 September 2019.

Event : 2nd RSC-BMCS / RSC-CICAG Artificial Intelligence in Chemistry
Dates : Monday-Tuesday, 2nd to 3rd September 2019
Place : Fitzwilliam College, Cambridge, UK
Websites : Event website, and RSC website.

Twitter #AIChem19


Applications for both oral and poster presentations are welcomed. Posters will be displayed throughout the day and applicants are asked if they wished to provide a two-minute flash oral presentation when submitting their abstract. The closing dates for submissions are:

  • 31st March for oral and
  • 5th July for poster

Full details can be found on the Event website,


A Jupyter Kernel for Swift


I'm constantly impressed by the expansion of Jupyter it is rapidly becoming the first-choice platform for interactive computing.

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

A latest expansion is a Jupyter Kernel for Swift, intended to make it possible to use Jupyter with the Swift for TensorFlow project.

Swift for TensorFlow is a new way to develop machine learning models. It gives you the power of TensorFlow directly integrated into the Swift programming language. With Swift, you can write the following imperative code, and Swift automatically turns it into a single TensorFlow Graph and runs it with the full performance of TensorFlow Sessions on CPU, GPU and TPU.

Requires MacOS 10.13.5 or later, with Xcode 10.0 beta or later


GuacaMol, benchmarking models.


Comparison of different algorithms is an under researched area, this publication looks like a useful starting point.

GuacaMol: Benchmarking Models for De Novo Molecular Design

De novo design seeks to generate molecules with required property profiles by virtual design-make-test cycles. With the emergence of deep learning and neural generative models in many application areas, models for molecular design based on neural networks appeared recently and show promising results. However, the new models have not been profiled on consistent tasks, and comparative studies to well-established algorithms have only seldom been performed. To standardize the assessment of both classical and neural models for de novo molecular design, we propose an evaluation framework, GuacaMol, based on a suite of standardized benchmarks. The benchmark tasks encompass measuring the fidelity of the models to reproduce the property distribution of the training sets, the ability to generate novel molecules, the exploration and exploitation of chemical space, and a variety of single and multi-objective optimization tasks. The benchmarking framework is available as an open-source Python package.

Source code :

The easiest way to install guacamol is with pip:

pip install git+ --process-dependency-links

guacamol requires the RDKit library (version 2018.09.1.0 or newer).


Updated Conda


I've been checking a few things since I updated. One thing that was immediately apparent was the similarity maps in RDKit are much nicer! As you can see from the output of the HERG prediction.


Feel like I got something for free.


Accessing a Jupyter Notebook HERG model from Vortex


A recent paper "The Catch-22 of Predicting hERG Blockade Using Publicly Accessible Bioactivity Data" DOI described a classification model for HERG activity. I was delighted to see that all the datasets used in the study, including the training and external datasets, and the models generated using these datasets were provided as individual data files (CSV) and Python Jupyter notebooks, respectively, on GitHub

The models were downloaded and the Random Forest Jupyter Notebooks (using RDKit) modified to save the generated model using pickle to store the predictive model, and then another Jupyter notebook was created to access the model without the need to rebuild the model each time. This notebook was exported as a python script to allow command line access, and Vortex scripts created that allow the user to run the model within Vortex and import the results and view the most significant features.

All models and scripts are available for download.

Full details are here…



Deep Learning Cheat Sheet (using Python Libraries)


Just came across this really invaluable resource.

  • Deep Learning Cheat Sheet (using Python Libraries)
  • PySpark Cheat Sheet: Spark in Python
  • Data Science in Python: Pandas Cheat Sheet
  • Cheat Sheet: Python Basics For Data Science
  • A Cheat Sheet on Probability
  • Cheat Sheet: Data Visualization with R
  • New Machine Learning Cheat Sheet by Emily Barry
  • Matplotlib Cheat Sheet
  • One-page R: a survival guide to data science with R
  • Cheat Sheet: Data Visualization in Python
  • Stata Cheat Sheet
  • Common Probability Distributions: The Data Scientist’s Crib Sheet
  • Data Science Cheat Sheet
  • 24 Data Science, R, Python, Excel, and Machine Learning Cheat Sheets
  • 14 Great Machine Learning, Data Science, R , DataViz Cheat Sheets


A tutorial on KNIME Deeplearning4J Integration


An interesting blog post

The aim of this blog post is to highlight some of the key features of the KNIME Deeplearning4J (DL4J) integration, and help newcomers to either Deep Learning or KNIME to be able to take their first steps with Deep Learning in KNIME Analytics Platform.


Accessing Jupyter Notebook model from Vortex

Chemical Drawing Programs – The Comparison of Accelrys (Symyx) Draw, ChemDraw, DrawIt, ACD/ChemSketch, ChemDoodle and Chemistry 4-D Draw

There is also a comparison of six chemical drawing packages here