Scripting Vortex 24
In Scripting Vortex 22 I described the use of the ability to script multiple sub-structure searches using SMARTS. There are many occasions when this sort of feature is useful, if you want to flag molecules that contain reactive functional groups, toxicophores, or PAINS functional groups that have been shown to interfere with a variety of screens. Alternatively if you have a drug discovery project with multiple chemotypes you might want to tag particular groups of compounds as belonging to a named series to aid analysis.
Whilst the original script was certainly very useful it was rather slow on large workspaces, to calculate all the PAINS filters for the 1.4M structures in ChEMBL took around 1 hour. Over the last couple of months I've been working with the Dan Ormsby of Dotmatics to improve the speed of searching. The latest builds (greater than 37278) of Vortex contain a new multi-core cheminformatics engine that can be enabled in the preferences panel that greatly improves search speeds. I always think one of the best ways to evaluate new features is really stress test them!
The results below illustrate my findings with several freely available large datasets of molecules.
PAINS script speed tests
Jonathan B. Baell and Georgina A. Holloway published a very interesting paper on their analysis of frequent hitters from screening assays. DOI, more recently Walters et al DOI proposed and additional set of PAINS based on observations from a sulfhydryl-scavenging high-throughput screen. These structural motif definitions were combined into a single script that now contains a total of 487 SMARTS definitions. The table below shows the time taken to annotate several databases using the modified script. It should be noted that the filters cannot be truly comprehensive since they can only test the chemical space encompassed by the original compound collection. As shown by the Walters paper additional PAINS will certainly be identified as novel chemical space is explored. The table below shows the times for the script to run on several datasets of varying sizes, times are in hours:mins:seconds format.
UPDATE. There have now been a couple of new publications describing the identification of false positives in high-throughput screening campaigns in which the binding of glutathione S-transferase (GST) to glutathione (GSH) is used for detection of GST-tagged proteins.
There have also been some suggestions as to how some of the motifs might be interfering with the assay.
Identification of Small-Molecule Frequent Hitters of Glutathione S-Transferase–Glutathione Interaction [DOI]
Identification of Small-Molecule Frequent Hitters from AlphaScreen High-Throughput Screens [DOI]
I've now added the additional structural motif definitions taking the total to 550 SMARTS definitions. It is perhaps worth mentioning that some of these motifs may not be an issue when using alternative screening technologies, but it may be very worthwhile to double check any molecules flagged by this script before committing significant resources to follow up.
IncCalc SMILES refers to the total time taken including calculating and indexing SMILES, PreCalc SMILES refers to the search using pre-calculated SMILES. NT is Not Tested
|Database||Number of Molecules||IncCalc SMILES||PreCalc SMILES|
It is worth highlighting the speed of this pattern matching, there are 487 SMARTS queries in the filter which means the process is running at 2.36M matches per second!
To achieve the speed in searching indexed SMILES are used, this means that even if you open a file in SDF format a column containing the SMILES will need to be created. The first part of the script creates a column containing a SMILES string representation of the molecule, for larger datasets it may be actually be faster to first calculate the SMILES using the Tools menu. For the ZINC dataset of just over 22M molecules generating the SMILES from the Tools menu took just over 30 mins compared to approaching an hour using the script.
Another point worth noting for these large files is that even if you open a compressed SDF file (filename.sdf.gz) the file is first uncompressed and stored in a temp folder, Vortex then just keeps the byte wise seek into the original file then fetches MOL files on demand from disk. Disk space can thus appear to disappear! Importing SMILES is probably a better option in the longer term.
University of Dundee NTD Screening Library Filters ChemMedChem. 3(3): 435444. Compounds containing unwanted functionalities were removed as it is not desirable to waste resources removing such functionalities in the hit list. These included potentially mutagenic groups such as nitro groups, groups likely to have unfavourable pharmacokinetic properties such as sulphates and phosphates; and reactive groups such as 2-halopyridines or thiols.
This script contains 101 SMARTS definitions
|Database||Number of Mols||IncCalc SMILES||PreCalc SMILES|