Macs in Chemistry

Insanely Great Science

Data-driven Advice for Applying Machine Learning to Bioinformatics Problems


A very useful paper

Here we contribute a thorough analysis of 13 state-of-the-art, commonly used machine learning algorithms on a set of 165 publicly available classification problems in order to provide data-driven algorithm recommendations to current researchers. We present a number of statistical and visual comparisons of algorithm performance and quantify the effect of model selection and algorithm tuning for each algorithm and dataset. The analysis culminates in the recommendation of five algorithms with hyperparameters that maximize classifier performance across the tested problems, as well as general guidelines for applying machine learning to supervised classification problems.

Good to see my preferred method Random Forest close to the top of the ranking based on performance over 165 datasets.

The rankings show the strength of ensemble-based tree algorithms in generating accurate models: The first, second, and fourth-ranked algorithms belong to this class of algorithms.

All 13 ML algorithms were used as implemented in scikit-learn, a popular ML library implemented in Python.


PAINS Vortex script


One of the great features of the latest version of Vortex (> build 29622) is the ability to script multiple sub-structure searches using SMARTS. There are many occasions when this sort of feature is useful, if you want to flag molecules that contain reactive functional groups, toxicophores, or PAINS functional groups that have been shown to interfere with high-throughput screens. Vortex tutorial 24 described how to do this multi-substructure searching.

There have now been a couple of new publications describing the identification of false positives in high-throughput screening campaigns in which the binding of glutathione S-transferase (GST) to glutathione (GSH) is used for detection of GST-tagged proteins.

  • Identification of Small-Molecule Frequent Hitters of Glutathione S-Transferase–Glutathione Interaction DOI
  • Identification of Small-Molecule Frequent Hitters from AlphaScreen High-Throughput Screens DOI

There have also been some suggestions as to how some of the motifs might be interfering with the assay, as shown below.


I've now added the additional structural motif definitions taking the total to 550 SMARTS definitions. It is perhaps worth mentioning that some of these motifs may not be an issue when using alternative screening technologies, but it may be very worthwhile to double check any molecules flagged by this script before committing significant resources to follow up.

This comment in Nature is perhaps worth noting

Academic researchers, drawn into drug discovery without appropriate guidance, are doing muddled science. When biologists identify a protein that contributes to disease, they hunt for chemical compounds that bind to the protein and affect its activity. A typical assay screens many thousands of chemicals. ‘Hits’ become tools for studying the disease, as well as starting points in the hunt for treatments. These molecules — pan-assay interference compounds, or PAINS — have defined structures, covering several classes of compound. But biologists and inexperienced chemists rarely recognize them. Instead, such compounds are reported as having promising activity against a wide variety of proteins. Time and research money are consequently wasted in attempts to optimize the activity of these compounds. Chemists make multiple analogues of apparent hits hoping to improve the ‘fit’ between protein and compound. Meanwhile, true hits with real potential are neglected.

I've updated the tutorial and the scripts for download.


A workflow for docking/virtual screening part 2


In the previous workflow I described docking a set of ligands with known activity into a target protein, in this workflow we will be using a set of ligands from the ZINC dataset searching for novel ligands. Once docked the workflow moves on to finding vendors and selecting subsets for purchase.



3D printing large models


Whilst I've seen lots of examples of printed small models this is the first time I've seen an example of models suitable for using as teaching aids in a lecture theatre, excellent idea.

Three-Dimensional Printing of a Scalable Molecular Model and Orbital Kit for Organic Chemistry Teaching and Learning DOI

Three-dimensional (3D) chemical models are a well-established learning tool used to enhance the understanding of chemical structures by converting two-dimensional paper or screen outputs into realistic three-dimensional objects. While commercial atom model kits are readily available, there is a surprising lack of large molecular and orbital models that could be used in large spaces. As part of a program investigating the utility of 3D printing in teaching, a modular size-adjustable molecular model and orbital kit was developed and produced using 3D printing and was used to enhance the teaching of stereochemistry, isomerism, hybridization, and orbitals.

Now added to the 3D-printing page.


CICAG Summer newsletter


The 2017 Summer Newsletter is now available for download.

This includes reports from the scientific meetings supported, and details of potential future meetings, together with news items that might be of interest to members of RSC CICAG interest group.

The Chemical Information and Computer Applications Group (CICAG) is one of the RSC’s many member-led Interest Groups.

The aims of the group are:-

  • support users of chemical information, data and computer applications and advance excellence in the chemical sciences
  • inform  RSC members and others of the latest developments in these rapidly evolving areas;
  • promote the wider recognition of excellence in chemical information and computer applications at this level.

If you are an RSC member who is interested in joining the group contact the membership team