Macs in Chemistry

Insanely great science

 

A Review of CheS-Mapper

CheS-Mapper (Chemical Space Mapper) is a open source 3D-viewer for chemical datasets of small molecules, a recent publication in the Journal of Chemiformatics describes the application DOI: 10.1186/1758-2946-4-7, In addition more information is available on the wiki page. Whilst there are many applications for the visual analysis of data, very few provide the tools needed to handle chemical structures, CheS-Mapper is a java application that runs under Mac OSX (I only tested Lion) based on the Java libraries Jmol, CDK, WEKA, and utilizes OpenBabel and R, that provides an interesting means to explore chemical data sets. The application (32MB) can be downloaded here, it requires that OpenBabel and R are installed independently.

I ran into problems with out of memory error when using data sets of greater than 2000 molecules but you can get around that by allocating more memory, rather double clicking on the application to open use a terminal command where you can allocate memory as shown below.

java -Xmx2056m -jar /Users/username/Downloads/ches-mapper-complete.jar

The other advantage of doing this is you get a nice log of the processes displayed in the Terminal window as shown below.

Loading dataset file> Loading dataset: caco2.sdf
read dataset file '/Users/username/Downloads/caco2.sdf' with cdk done (100 compounds found)
Loading dataset file> finished
loaded 50 cdk descriptors
babel > /usr/local/bin/babel -L descriptors
Chemical space mapping> Compute 3d compound structures
Chemical space mapping> Compute features
Chemical space mapping> Computing feature 1/7 : Hydrogen Bond Acceptors
writing cdk props to: /Users/username/.ches-mapper/Users/username/Downloads/caco2.52a4a3775e4a7d807aaa37b45db1e72e.Hydrogen+Bond+Acceptors
Chemical space mapping> Computing feature 2/7 : Hydrogen Bond Donors
writing cdk props to: /Users/username/.ches-mapper/Users/username/Downloads/caco2.52a4a3775e4a7d807aaa37b45db1e72e.Hydrogen+Bond+Donors
Chemical space mapping> Computing feature 3/7 : Molecular Weight
writing cdk props to: /Users/username/.ches-mapper/Users/username/Downloads/caco2.52a4a3775e4a7d807aaa37b45db1e72e.Molecular+Weight
Chemical space mapping> Computing feature 4/7 : XLogP
writing cdk props to: /Users/username/.ches-mapper/Users/username/Downloads/caco2.52a4a3775e4a7d807aaa37b45db1e72e.XLogP
Chemical space mapping> Computing feature 5/7 : Topological Polar Surface Area
writing cdk props to: /Users/username/.ches-mapper/Users/username/Downloads/caco2.52a4a3775e4a7d807aaa37b45db1e72e.Topological+Polar+Surface+Area
Chemical space mapping> Computing feature 6/7 : OpenBabel FP4
babel > /usr/local/bin/babel -V
computing structural fragment CDK 10 true
ob-fingerprints > /usr/local/bin/babel  -isdf  /Users/username/Downloads/caco2.sdf  -ofpt  -xf  FP4  -xs
100 molecules converted
2074 audit log messages 
Chemical space mapping> Computing feature 7/7 : OpenBabel Linear Fragments (FP2)
computing structural fragment CDK 10 true
ob-fingerprints > /usr/local/bin/babel  -isdf  /Users/username/Downloads/caco2.sdf  -ofpt  -xf  FP2  -xs
100 molecules converted
1880 audit log messages 
Chemical space mapping> Num features computed: 270
compute smiles..  ..done, store: /Users/username/.ches-mapper/Users/username/Downloads/caco2.52a4a3775e4a7d807aaa37b45db1e72e.smiles
Chemical space mapping> Compute clusters

Importing data

There are a couple of datasets that you can download and use alternatively you can import your own dataset. I used a my own dataset of 1625 compounds with known HERG activity together with 17 data fields with each compound in sdf file format, the file was imported in 10 secs (MacBook Pro). Looking at the documentation it seems a variety of file formats are supported, the documentation does not mention SMILES but apparently it has been added and the documentation has not yet been updated. The workflow is shown in the image below, the first step is to create 3D structures (if not available in the file) this can be done with either the Chemistry Development Kit (CDK) (using MM2 or MMFF94) or OpenBabel (using MMFF94) structure generators. I found this was rather slow and in subsequent runs I created 3D structures first using MOE. Checking using activity monitor it seems that CheS-Mapper is only able to use a single core at a time for the 3D structure generation. If the 3D structures have been derived from a docking study or aligned using an application like FieldAlign it is probably better to retain the original coordinates rather than generate a new 3D structure.

workflow

Selecting Features and Descriptors

The next step is to select the features present in the original file, the extract features dialog allows the user to see the distribution of the properties in the data set displayed as a histogram. The same dialog allows the user to select descriptors that can be calculated using either OpenBabel or CDK, this includes relatively simple features like molecular weight, or the number of rotatable bonds as well as sophisticated chemical descriptors like LogP, van der Waals volume or TPSA, together with structural fragment descriptors such as MACCS, linear path-based fingerprints or ToxTree rules describing various toxicities.

chesmapper1

Clustering

The next option is for clustering, this has a number of benefits. Firstly it provides an early indication that there are groups of molecules within the dataset that have similar properties, it also aids with the visualisation in that it provides an easy way to limit the number of molecules displayed at any time. By default simple k-means clustering is employed, but the advanced options allow the user to select alternative clustering algorithms. The clustering methods include k-means, Cascade k-means, Hierachial clustering, Expectation maximisation, Cobweb, and Farthest first, some of the versions of the algorithms depend on the presence of R.

The final option is 3D embedding using principle component analysis (PCA), and the molecules aligned using MCS, first, the Maximum Common Subgraph (MCS) of each cluster is computed. This is computationally intensive and will take quite long for large clusters (the runtime is O(n²)). Second the compounds of each cluster are aligned according to their MCS. Hence, their orientation in 3D space is adjusted such that the common substructure is superimposed this uses the CDK. If the data set all has a common structural feature (e.g. all substituted indoles) then it is possible to provide a SMARTS string onto which all the structures will be aligned.

With it all set up the application then generates 3D structures, extracts and/or calculates descriptors, and performs the 3D overlay. The wiki provides details of the algorithm runtimes. As I mentioned earlier generation of the 3D structures is very slow particularly using OpenBabel so it is well worth thinking about generating 3D structures beforehand. Generation of the descriptors and fingerprints is generally pretty fast but the clustering can be slow.

Visualisation

Once the calculations are complete a JMOL powered display opens that allows the user to explore the data set, you can select individual clusters and the other clusters will fade from view, you can also colour code based on a feature, the features can be recalculated in the imported file or can be those calculated in the initial phase. A histogram is also displayed showing the distribution of properties and the number of compounds in both the full data set and the selected cluster. The display updates, rotates and zooms really smoothly even with several thousand structures. The structures can also be labeled with the curvently displayed feature. For categorical features such as active/inactive the molecules are colour coded according to which category they are in.

CheSMapper2

For larger datasets whilst the clustering allows some simplification of the display I still found it rather crowded particularly with high molecular weight structures, and I think an option to switch off the structure and only display points would be useful. It would also be useful to be able to select a bar in the histogram and highlight the corresponding structures or manual select a group of compounds.

I really liked the way in which the user was guided through the workflow, a nice addition might be the option to store workflows to be reused as a program generates more compounds.

There is a very nice brief video tutorial available

Several other applications are available for the analysis of chemical data sets, these include Vortex, Stardrop and Instant JChem.

There is a list of software reviews here

Updated 10 May 2012