In June 2018 the First RSC-BMCS / RSC-CICAG Artificial Intelligence in Chemistry meeting was held in London. This proved to enormously popular, there were more oral abstracts and poster submissions than we had space for and was so over-subscribed we could have filled a venue double the size.
Planning for the second meeting is now in full swing, and it will be held in Cambridge 2-3 September 2019.
Event : 2nd RSC-BMCS / RSC-CICAG Artificial Intelligence in Chemistry
Dates : Monday-Tuesday, 2nd to 3rd September 2019
Place : Fitzwilliam College, Cambridge, UK
Websites : Event website, and RSC website.
Applications for both oral and poster presentations are welcomed. Posters will be displayed throughout the day and applicants are asked if they wished to provide a two-minute flash oral presentation when submitting their abstract. The closing dates for submissions are:
- 31st March for oral and
- 5th July for poster
Full details can be found on the Event website,
I'm constantly impressed by the expansion of Jupyter it is rapidly becoming the first-choice platform for interactive computing.
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.
Swift for TensorFlow is a new way to develop machine learning models. It gives you the power of TensorFlow directly integrated into the Swift programming language. With Swift, you can write the following imperative code, and Swift automatically turns it into a single TensorFlow Graph and runs it with the full performance of TensorFlow Sessions on CPU, GPU and TPU.
Requires MacOS 10.13.5 or later, with Xcode 10.0 beta or later
Comparison of different algorithms is an under researched area, this publication looks like a useful starting point.
De novo design seeks to generate molecules with required property profiles by virtual design-make-test cycles. With the emergence of deep learning and neural generative models in many application areas, models for molecular design based on neural networks appeared recently and show promising results. However, the new models have not been profiled on consistent tasks, and comparative studies to well-established algorithms have only seldom been performed. To standardize the assessment of both classical and neural models for de novo molecular design, we propose an evaluation framework, GuacaMol, based on a suite of standardized benchmarks. The benchmark tasks encompass measuring the fidelity of the models to reproduce the property distribution of the training sets, the ability to generate novel molecules, the exploration and exploitation of chemical space, and a variety of single and multi-objective optimization tasks. The benchmarking framework is available as an open-source Python package.
Source code : https://github.com/BenevolentAI/guacamol.
The easiest way to install guacamol is with pip:
pip install git+https://github.com/BenevolentAI/guacamol.git#egg=guacamol --process-dependency-links
guacamol requires the RDKit library (version 2018.09.1.0 or newer).
A recent paper "The Catch-22 of Predicting hERG Blockade Using Publicly Accessible Bioactivity Data" DOI described a classification model for HERG activity. I was delighted to see that all the datasets used in the study, including the training and external datasets, and the models generated using these datasets were provided as individual data files (CSV) and Python Jupyter notebooks, respectively, on GitHub https://github.com/AGPreissner/Publications).
The models were downloaded and the Random Forest Jupyter Notebooks (using RDKit) modified to save the generated model using pickle to store the predictive model, and then another Jupyter notebook was created to access the model without the need to rebuild the model each time. This notebook was exported as a python script to allow command line access, and Vortex scripts created that allow the user to run the model within Vortex and import the results and view the most significant features.
All models and scripts are available for download.
Just came across this really invaluable resource.
- Deep Learning Cheat Sheet (using Python Libraries)
- PySpark Cheat Sheet: Spark in Python
- Data Science in Python: Pandas Cheat Sheet
- Cheat Sheet: Python Basics For Data Science
- A Cheat Sheet on Probability
- Cheat Sheet: Data Visualization with R
- New Machine Learning Cheat Sheet by Emily Barry
- Matplotlib Cheat Sheet
- One-page R: a survival guide to data science with R
- Cheat Sheet: Data Visualization in Python
- Stata Cheat Sheet
- Common Probability Distributions: The Data Scientist’s Crib Sheet
- Data Science Cheat Sheet
- 24 Data Science, R, Python, Excel, and Machine Learning Cheat Sheets
- 14 Great Machine Learning, Data Science, R , DataViz Cheat Sheets
An interesting blog post
The aim of this blog post is to highlight some of the key features of the KNIME Deeplearning4J (DL4J) integration, and help newcomers to either Deep Learning or KNIME to be able to take their first steps with Deep Learning in KNIME Analytics Platform.
I've become a great fan of Jupyter Notebooks as a way of modelling cheminformatics data, and I've published some of the notebooks here.
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.
In the predicting AMES activity notebook I also looked at the use of pickle to store the predictive model and then access it using a Jupyter notebook without the need to rebuild the model. Whilst a notebook is a nice way to access the predictive model it might also be useful to be able to access it from other applications or from the command line.
In this tutorial we look at providing command line access to the model and then incorporating it into a Vortex script.
A very useful paper https://arxiv.org/abs/1708.05070
Here we contribute a thorough analysis of 13 state-of-the-art, commonly used machine learning algorithms on a set of 165 publicly available classification problems in order to provide data-driven algorithm recommendations to current researchers. We present a number of statistical and visual comparisons of algorithm performance and quantify the effect of model selection and algorithm tuning for each algorithm and dataset. The analysis culminates in the recommendation of five algorithms with hyperparameters that maximize classifier performance across the tested problems, as well as general guidelines for applying machine learning to supervised classification problems.
Good to see my preferred method Random Forest close to the top of the ranking based on performance over 165 datasets.
The rankings show the strength of ensemble-based tree algorithms in generating accurate models: The first, second, and fourth-ranked algorithms belong to this class of algorithms.
All 13 ML algorithms were used as implemented in scikit-learn, a popular ML library implemented in Python.
An interesting post By Matthew Mayo, KDnuggets.
Here is a quick collection of such books to start your fair weather study off on the right foot. The list begins with a base of statistics, moves on to machine learning foundations, progresses to a few bigger picture titles, has a quick look at an advanced topic or 2, and ends off with something that brings it all together. A mix of classic and contemporary titles, hopefully you find something new (to you) and of interest here.
I've been experimenting with the use of Jupyter Notebooks (aka iPython Notebooks) as an electronic lab notebook but also a means to share computational models. The aim would be to see how easy it would be to share a model together with the associated training data together with an explanation of how the model was built and how it can be used for novel molecules.
The Ames test is a widely employed method that uses bacteria to test whether a given chemical can cause mutations in the DNA of the test organism. More formally, it is a biological assay to assess the mutagenic potential of chemical compounds. PNAS. 70 (8): 2281–5. doi
In this first notebook a random forest model to predict AMES activity is described….