Macs in Chemistry

Insanely Great Science

A collection of Vortex scripts to aid cluster analysis


Exploration and sorting large datasets of molecules often involves the use of clustering techniques to group together molecules with similar properties. It can be used to flag outliers or highlight particular patterns, functional groups, or scaffolds. Whilst there are many clustering algorithms it is often a challenge to sort through and analysis the results.

With millions of structures it is not really practical to simply scroll through the table, but here are a selection of scripts that might help with further analysis.


Read full article here.


Vortex running on ARM-based Mac


Sounds like performance is not an issue.


Small molecules approved by FDA in 2019


After I posted Small molecules approved by FDA in 2019 a number of people contacted me asking for the dataset, they then asked how it was created. So I thought I'd put together a brief description of the process.


A Vortex script to calculate the Blood-Brain Barrier (BBB) SCORE


A recent publication described "The Blood–Brain Barrier (BBB) Score" DOI a scoring function to determine the likelihood of a molecule being brain penetrant.

Since I'm often asked about improving CNS penetration it seemed useful to implement the algorithm in a Vortex script.

There are more details and a download link here.

The data for over 1000 examples is provided in the supplementary information, this includes both CNS penetrant and non-penetrant compounds. The plot below compares the data from the supplementary information (SuppInf_BBBscore) with the data calculated for this implementation in the Vortex script. Whilst overall there is good agreement there appear to be a few outliers. On closer investigation many of the differences appear to be due to differences in the calculated TPSA. Since both implementations use the same ChemAxon software it is possible that updated version (I used version 19.8.0) has resulted in the differences.



Determining the Amino Acids in a collection of peptides


I've recently become interested the comparison of the amino amino-acid composition of peptides, to allow comparison of cyclic versus linear peptides, or brain penetrant curses non-penetrant. I had a look around but could not find any tools that did this, in particular I wanted to include any non-proteinergic amino-acids.

This tutorial provides a means to analyse many thousands of peptides using Vortex.


Counting Identical structures in two datasets


Sometimes I have two datasets and I just want to know the overlap of identical structures. This Vortex script counts the number of identical structures by comparing InChIKeys. It then displays a matrix showing how many unique molecules in each dataset and how many molecules are in both datasets.



Rescoring Docking using RF-Score-VS


A little while back I described a docking workflow including a rescoring script for Vortex, so I thought it might be useful to include this on a separate page.

Recently, machine-learning scoring functions trained on protein-ligand complexes have shown significant promise an example being (RF-Score-VS) trained on 15 426 active and 893 897 inactive molecules docked to a set of 102 targets DOI.

Our results show RF-Score-VS can substantially improve virtual screening performance: RF-Score-VS top 1% provides 55.6% hit rate, whereas that of Vina only 16.2% (for smaller percent the difference is even more encouraging: RF-Score-VS top 0.1% achieves 88.6% hit rate for 27.5% using Vina). In addition, RF-Score-VS provides much better prediction of measured binding affinity than Vina (Pearson correlation of 0.56 and −0.18, respectively). Lastly, we test RF-Score-VS on an independent test set from the DEKOIS benchmark and observed comparable results.

Binaries for RF-Score-VS are available


The full details of the Vortex script are here.


Making a Random Selection


Sometimes it is the simplest scripts that prove to be the most useful, the most downloaded AppleScript on the site is the one that simply prints the text on the clipboard.

I regularly need to select a specified number of molecules in a random fashion and this script does just that. Import a sdf file containing structures into Vortex and run the script to make a random selection.


Full details here….


Accessing a Jupyter Notebook HERG model from Vortex


A recent paper "The Catch-22 of Predicting hERG Blockade Using Publicly Accessible Bioactivity Data" DOI described a classification model for HERG activity. I was delighted to see that all the datasets used in the study, including the training and external datasets, and the models generated using these datasets were provided as individual data files (CSV) and Python Jupyter notebooks, respectively, on GitHub

The models were downloaded and the Random Forest Jupyter Notebooks (using RDKit) modified to save the generated model using pickle to store the predictive model, and then another Jupyter notebook was created to access the model without the need to rebuild the model each time. This notebook was exported as a python script to allow command line access, and Vortex scripts created that allow the user to run the model within Vortex and import the results and view the most significant features.

All models and scripts are available for download.

Full details are here…



Implementing AB-MPS scoring


Whilst the rule of 5 (Ro5) has provided a useful way to describe small molecule drug space it is also clear that there are a significant number of molecular classes that exist beyond the rule of 5 boundaries (bRo5). In a review of the AbbVie compound collection DOI they were able to identify key findings that might explain the success (or failure) of bRo5 projects. From an analysis of a variety of calculated physicochemical properties they proposed a simple multiparametric scoring function (AB-MPS) was devised that correlated preclinical PK results with cLogD, number of rotatable bonds, and number of aromatic rings.

AB-MPS = Abs(cLogD-3) + NAR + NRB

Now implemented as a Vortex script.


Updated Literature search script


I've updated the Vortex script to run text based queries of PubMed.

If you regularly use the E-utilities API you might want to read this.

After May 1, 2018, NCBI will limit your access to the E-utilities unless you have one of these keys. Obtaining an API key is quick, and simple, and will allow you to access NCBI data faster. If you don’t have an API key, E-utilities will still work, but you may be limited to fewer requests than allowed with an API key.

After May 1, 2018, any computer (IP address) that submits more than 3 E-utility requests per second will receive an error message. This limit applies to any combination of requests to EInfo, ESearch, ESummary, EFetch, ELink, EPost, ESpell, and EGquery.

If you write software of scripts that access the E-utilities API then the users will need to get their own api key. Calls will have this format

I've updated this script to reflect this change, and I've highlighted where you need to add your api key in the script. I've also tried to ensure that any query string should be encoded to make it URL safe and I've extended the search range up to 2018.



Flagging Potential Kinase Inhibitors


Most of kinase inhibitors bind in the region of the ATP binding site using the hydrogen bonding interactions of the hinge region shown in the schematic below. We can use the knowledge of these hinge binding motifs to flag potential kinase inhibitors.




Vortex update

Dotmatics have announced the impending release of the latest update to Vortex

The focus appears to be on the enhancement of the Vortex bioinformatics tools reviewed previously.


MayaChem Tools


MayaChemTools is a fabulous collection of Perl and Python scripts, modules, and classes to support a variety of day-to-day computational discovery needs.

The core set of command line Perl scripts available in the current release of MayaChemTools has no external dependencies and provide functionality for the following tasks:

  • Manipulation and analysis of data in SD, CSV/TSV, sequence/alignments, and PDB files
  • Listing information about data in SD, CSV/TSV, Sequence/Alignments, PDB, and fingerprints files
  • Calculation of a key set of physicochemical properties, such as molecular weight, hydrogen bond donors and acceptors, logP, and topological polar surface area
  • Generation of 2D fingerprints corresponding to atom neighborhoods, atom types, E-state indices, extended connectivity, MACCS keys, path lengths, topological atom pairs, topological atom triplets, topological atom torsions, topological pharmacophore atom pairs, and topological pharmacophore atom triplets
  • Generation of 2D fingerprints with atom types corresponding to atomic invariants, DREIDING, E-state, functional class, MMFF94, SLogP, SYBYL, TPSA and UFF
  • Similarity searching and calculation of similarity matrices using available 2D fingerprints
  • Listing properties of elements in the periodic table, amino acids, and nucleic acids
  • Exporting data from relational database tables into text files

The command line Python scripts based on RDKit provide functionality for the following tasks:

  • Calculation of molecular descriptors
  • Comparison 3D molecules based on RMSD and shape
  • Conversion between different molecular file formats
  • Enumeration of compound libraries and stereoisomers
  • Filtering molecules using SMARTS, PAINS, and names of functional groups
  • Generation of graph and atomic molecular frameworks
  • Generation of images for molecules
  • Performing structure minimization and conformation generation based on distance geometry and forcefields
  • Picking and clustering molecules based on 2D fingerprints and various clustering methodologies
  • Removal of duplicate molecules

These invaluable scripts can be used in other applications, I've written a Vortex Script that uses them.


Scripting PubMed searches


PubMed comprises more than 24 million citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full-text content from PubMed Central and publisher web sites. They also provide a number of programming tools that allow access to the information, E-utilities are a set of server-side programs that provide a stable interface into the Entrez query and database system.

To access these data, a piece of software first posts an E-utility URL to NCBI, then retrieves the results of this posting, after which it processes the data as required. The software can thus use any computer language that can send a URL to the E-utilities server and interpret the XML response; examples of such languages are Perl, Python, Java, and C++.

A while back I wrote a vortex script that helps with these sort of searches if you have multiple terms you want to search. I've updated this script to incorporate the changes requiring api keys to allow multiple requests to the E-utilities api, and I've highlighted where you need to add your own api key in the script. I've also tried to ensure that any query string should be encoded to make it URL safe.

The update is detailed more fully here….



Downloading from the RCSB Protein Data Bank using Python


The RCSB Protein Data Bank is an absolutely invaluable resource that provides archive-information about the 3D shapes of proteins, nucleic acids, and complex assemblies that helps scientists understand all aspects of biomedicine and agriculture, from protein synthesis to health and disease. Currently the PDB contains over 134,000 data files containing structural information on 42547 distinct protein sequences of which 37600 are human sequences. They also provide a series of tools to search, view and analyse the data.

Downloading an individual pdf file is pretty trivial and can be done from the web page as shown in the image below. They also provide a Download Tool launched as stand-alone application using the Java Web Start protocol. The tool is downloaded locally and must be then opened. I've found this a little temperamental and had issues with Java versions and security settings.

Since I've been making extensive use of the web services to interact with RCSB I decided to explore the use of Python to download multiple files. I started off creating a Jupyter notebook using the web services provided by RCSB.

I've also used variations on this code to create a python script and a Vortex script.

Full details are here …


Interacting with the RCSB Protein Data Bank


The RCSB Protein Data Bank is an absolutely invaluable resource that provides archive-information about the 3D shapes of proteins, nucleic acids, and complex assemblies that helps scientists understand all aspects of biomedicine and agriculture, from protein synthesis to health and disease. Currently the PDB contains over 134,000 data files containing structural information on 42547 distinct protein sequences of which 37600 are human sequences. They also provide a series of tools to search, view and analyse the data.

The latest addition to the Hints and Tutorials page is a couple of Vortex scripts for interacting with the RCSB Protein Data Bank, specifically they search for PDB structures associated with a list of Uniprot codes, and then search for associated information. Read more here…


Predicting sites of metabolism Vortex script


It is really useful to have two sites of metabolism tools available that use contrasting methodologies, FAME 2 using curated dataset of experimentally determined metabolism data to build a machine learning model using simple descriptors. In contrast SMARTCyp uses precomputed activation energies from density functional theory (DFT) calculations of model compounds.

I previously wrote a script displaying the [results of a SMARTCyp calculation in a webview. The first part of the script imports the smartcyp.jar, however with each update I was finding issues so I thought it might be better to simply treat SMARTCyp as a command line application and use subprocess to access it.

Using a similar script we can also access FAME2

More details here.



Accessing Jupyter Notebook model from Vortex

Chemical Drawing Programs – The Comparison of Accelrys (Symyx) Draw, ChemDraw, DrawIt, ACD/ChemSketch, ChemDoodle and Chemistry 4-D Draw

There is also a comparison of six chemical drawing packages here