Yet another invaluable post on cheminformatics and machine learning Python package for Ensemble learning #Chemoinformatics #Scikit learn.
Ensemble learning sometime outperform than single model. So it is useful for try to use the method. Fortunately now we can use ensemble learning very easily by using a python package named ‘mlens‘
Install using PIP
pip install mlens
ML-Ensemble combines a Scikit-learn high-level API with a low-level computational graph framework to build memory efficient, maximally parallelized ensemble networks in as few lines of codes as possible.
This looks really interesting
TabNine is an autocompleter that helps you write code faster by adding a deep learning model which significantly improves suggestion quality. You can see videos at the link above.
There has been a lot of hype about deep learning in the past few years. Neural networks are state-of-the-art in many academic domains, and they have been deployed in production for tasks such as autonomous driving, speech synthesis, and adding dog ears to human faces. Yet developer tools have been slow to benefit from these advances
Deep TabNine is trained on around 2 million files from GitHub. During training, its goal is to predict each token given the tokens that come before it. To achieve this goal, it learns complex behaviour, such as type inference in dynamically typed languages.
An interesting idea, my only concern is the quality of code in the training set.
The 2nd RSC-BMCS / RSC-CICAG Artificial Intelligence in Chemistry meetings is filling up fast, however there are still 6 bursaries unallocated. The closing date for applications is 15 July. The bursaries are available up to a value of £250, to support registration, travel and accommodation costs for PhD and post-doctoral applicants studying at European academic institutions.
You can find details here https://www.maggichurchouseevents.co.uk/bmcs/AI-2019.htm.
Twitter hashtag - #AIChem19
I posted a poll on twitter
Looking at abstracts for the AI in Chemistry Meeting … many mine published data. The quality of the public data is obviously critical for good models. Is this something the AI community should be concerned about or get involved with to improve the quality of the literature?
The results are now in and interestingly despite nearly 2.5K impressions only 28 people voted. Of those that voted the overwhelming majority feel that AI scientists should help to improve the quality of the literature.
The comments associated with the tweet are interesting, certainly many machine learning models are robust enough to accommodate some poor data but I think there is a deeper concern.
Elisabeth Bik has regularly flagged questionable publications, unfortunately these are not always detected before their influence has been propagated through the literature.
For a very detailed example look at 5-HTTLPR: A POINTED REVIEW looking at an unusual version of the serotonin transporter gene 5-HTTLPR.
I've heard of many examples of scientists being unable to reproduce literature findings, usually little happens, however Amgen were able to reproduce only 6 out of 53 'landmark' studies and they published their findings.
How many times do scientists assume failure to reproduce published findings is their error?
There have been several studies looking at the possible causes of the failure to reproduce work, in 2011, an evaluation of 246 antibodies used in epigenetic studies found that one-quarter failed tests for specificity, meaning that they often bound to more than one target. Four antibodies were perfectly specific — but to the wrong target Reproducibility crisis: Blame it on the antibodies.
See also "The antibody horror show: an introductory guide for the perplexed" DOI
Colourful as this may appear, the outcomes for the community are uniformly grim, including badly damaged scientific careers, wasted public funding, and contaminated literature.
If you are mining literature data to predict novel drug targets then Caveat emptor.
I was just sent details of a Special Issue "Machine Learning with Python for the journal Information.
We live in this day and age where quintillions of bytes of data are generated and collected every day. Around the globe, researchers and companies are leveraging these vast amounts of data in countless application areas, ranging from drug discovery to improving transportation with self-driving cars.As we all know, Python evolved into the lingua franca of machine learning and artificial intelligence research over the last couple of years. What makes Python particularly attractive for us researchers is that it gives us access to a cohesive set of tools for scientific computing and is easy to teach and learn. Also, as a language that bridges many different technologies and different fields, Python fosters interdisciplinary collaboration. And besides making us more productive in our research, sharing tools we develop in Python has the potential to reach a wide audience and benefit the broader research community.
This special issue is now open for submission.
A Twitter poll can we trust published data and should AI community be involved?
Looking at abstracts for https://www.maggichurchouseevents.co.uk/bmcs/AI-2019.htm … many mine published data. The quality of the public data is obviously critical for good models.
In June 2018 the First RSC-BMCS / RSC-CICAG Artificial Intelligence in Chemistry meeting was held in London. This proved to enormously popular, there were more oral abstracts and poster submissions than we had space for and was so over-subscribed we could have filled a venue double the size.
Planning for the second meeting is now in full swing, and it will be held in Cambridge 2-3 September 2019.
Event : 2nd RSC-BMCS / RSC-CICAG Artificial Intelligence in Chemistry
Dates : Monday-Tuesday, 2nd to 3rd September 2019
Place : Fitzwilliam College, Cambridge, UK
Websites : Event website, and RSC website.
Applications for both oral and poster presentations are welcomed. Posters will be displayed throughout the day and applicants are asked if they wished to provide a two-minute flash oral presentation when submitting their abstract. The closing dates for submissions are:
- 31st March for oral and
- 5th July for poster
Full details can be found on the Event website,
I'm constantly impressed by the expansion of Jupyter it is rapidly becoming the first-choice platform for interactive computing.
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.
Swift for TensorFlow is a new way to develop machine learning models. It gives you the power of TensorFlow directly integrated into the Swift programming language. With Swift, you can write the following imperative code, and Swift automatically turns it into a single TensorFlow Graph and runs it with the full performance of TensorFlow Sessions on CPU, GPU and TPU.
Requires MacOS 10.13.5 or later, with Xcode 10.0 beta or later
Comparison of different algorithms is an under researched area, this publication looks like a useful starting point.
De novo design seeks to generate molecules with required property profiles by virtual design-make-test cycles. With the emergence of deep learning and neural generative models in many application areas, models for molecular design based on neural networks appeared recently and show promising results. However, the new models have not been profiled on consistent tasks, and comparative studies to well-established algorithms have only seldom been performed. To standardize the assessment of both classical and neural models for de novo molecular design, we propose an evaluation framework, GuacaMol, based on a suite of standardized benchmarks. The benchmark tasks encompass measuring the fidelity of the models to reproduce the property distribution of the training sets, the ability to generate novel molecules, the exploration and exploitation of chemical space, and a variety of single and multi-objective optimization tasks. The benchmarking framework is available as an open-source Python package.
Source code : https://github.com/BenevolentAI/guacamol.
The easiest way to install guacamol is with pip:
pip install git+https://github.com/BenevolentAI/guacamol.git#egg=guacamol --process-dependency-links
guacamol requires the RDKit library (version 2018.09.1.0 or newer).
A recent paper "The Catch-22 of Predicting hERG Blockade Using Publicly Accessible Bioactivity Data" DOI described a classification model for HERG activity. I was delighted to see that all the datasets used in the study, including the training and external datasets, and the models generated using these datasets were provided as individual data files (CSV) and Python Jupyter notebooks, respectively, on GitHub https://github.com/AGPreissner/Publications).
The models were downloaded and the Random Forest Jupyter Notebooks (using RDKit) modified to save the generated model using pickle to store the predictive model, and then another Jupyter notebook was created to access the model without the need to rebuild the model each time. This notebook was exported as a python script to allow command line access, and Vortex scripts created that allow the user to run the model within Vortex and import the results and view the most significant features.
All models and scripts are available for download.
Just came across this really invaluable resource.
- Deep Learning Cheat Sheet (using Python Libraries)
- PySpark Cheat Sheet: Spark in Python
- Data Science in Python: Pandas Cheat Sheet
- Cheat Sheet: Python Basics For Data Science
- A Cheat Sheet on Probability
- Cheat Sheet: Data Visualization with R
- New Machine Learning Cheat Sheet by Emily Barry
- Matplotlib Cheat Sheet
- One-page R: a survival guide to data science with R
- Cheat Sheet: Data Visualization in Python
- Stata Cheat Sheet
- Common Probability Distributions: The Data Scientist’s Crib Sheet
- Data Science Cheat Sheet
- 24 Data Science, R, Python, Excel, and Machine Learning Cheat Sheets
- 14 Great Machine Learning, Data Science, R , DataViz Cheat Sheets
An interesting blog post
The aim of this blog post is to highlight some of the key features of the KNIME Deeplearning4J (DL4J) integration, and help newcomers to either Deep Learning or KNIME to be able to take their first steps with Deep Learning in KNIME Analytics Platform.
I've become a great fan of Jupyter Notebooks as a way of modelling cheminformatics data, and I've published some of the notebooks here.
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.
In the predicting AMES activity notebook I also looked at the use of pickle to store the predictive model and then access it using a Jupyter notebook without the need to rebuild the model. Whilst a notebook is a nice way to access the predictive model it might also be useful to be able to access it from other applications or from the command line.
In this tutorial we look at providing command line access to the model and then incorporating it into a Vortex script.
A very useful paper https://arxiv.org/abs/1708.05070
Here we contribute a thorough analysis of 13 state-of-the-art, commonly used machine learning algorithms on a set of 165 publicly available classification problems in order to provide data-driven algorithm recommendations to current researchers. We present a number of statistical and visual comparisons of algorithm performance and quantify the effect of model selection and algorithm tuning for each algorithm and dataset. The analysis culminates in the recommendation of five algorithms with hyperparameters that maximize classifier performance across the tested problems, as well as general guidelines for applying machine learning to supervised classification problems.
Good to see my preferred method Random Forest close to the top of the ranking based on performance over 165 datasets.
The rankings show the strength of ensemble-based tree algorithms in generating accurate models: The first, second, and fourth-ranked algorithms belong to this class of algorithms.
All 13 ML algorithms were used as implemented in scikit-learn, a popular ML library implemented in Python.
An interesting post By Matthew Mayo, KDnuggets.
Here is a quick collection of such books to start your fair weather study off on the right foot. The list begins with a base of statistics, moves on to machine learning foundations, progresses to a few bigger picture titles, has a quick look at an advanced topic or 2, and ends off with something that brings it all together. A mix of classic and contemporary titles, hopefully you find something new (to you) and of interest here.
I've been experimenting with the use of Jupyter Notebooks (aka iPython Notebooks) as an electronic lab notebook but also a means to share computational models. The aim would be to see how easy it would be to share a model together with the associated training data together with an explanation of how the model was built and how it can be used for novel molecules.
The Ames test is a widely employed method that uses bacteria to test whether a given chemical can cause mutations in the DNA of the test organism. More formally, it is a biological assay to assess the mutagenic potential of chemical compounds. PNAS. 70 (8): 2281–5. doi
In this first notebook a random forest model to predict AMES activity is described….