This tutorial was kindly provided by Greg a Macinchem reader.
OVERVIEW AND APPLICATION OF KNIME AND CDKDescUI.jar
KNIME1, the Konstanz Information Miner, is a visual platform for graphically building and editing workflows and data analysis pipelines from defined components called nodes. KNIME is developed by Prof. Michael Berthold at the University of Konstanz, Germany. It can be downloaded for free from www.knime.org . KNIME is built on the Eclipse Interactive Development Environment and written in Java. Versions are available for Windows, Linux and now the Macintosh. Another unique aspect to the KNIME workflow program is the emphasis on chemistry and chemoinformatics through the incorporation of the CDK Chemistry project. The standard data analysis and file manipulation tools that are distributed with KNIME include:
• I/O nodes to read and write data from files and databases
• Data manipulation nodes to manage data in tables, like joining, filtering, partitioning, etc
• Plotting and chart tools
• Statistics and data mining tools, like clustering and machine learning
This overview/application uses KNIME with another new tool that could be used in drug discovery, CDKDescUI.jar2, which calculates molecular descriptors to use in predictive model building.
As an introduction to KNIME, lets take a look at an example where we have some molecules with ACTIVE/INACTIVE class biological screening data. Suppose we want to build an in silico predictive model to perform “Virtual Screening”, i.e. to predict activity on a new set of potential compounds to assay. I chose a literature dataset from a paper by Jorissen and Gilson3 titled “Virtual Screening of Molecular Databases Using a Support Vector Machine”. The datafiles are available online at: http://www.cheminformatics.org/datasets/index.shtml . You can download the structures and try this yourself. The structures are in tar/gzipped SDF format. Uncompress the structure file using ‘StuffIt Expander’ for the Mac. You will obtain several files.
I sampled 500 structures from “compoundsODD.sdf” as a model training set and 160 from “compoundsEVEN.sdf” as a test set. From each I chose the labeled COX2 compounds as “ACTIVE” and the rest as “INACTIVE”. I created a plain text file that has two columns, the first is the compound name and the second is the activity class, a “A” for ACTIVE and “I” for INACTIVE. This file is used later on to train the model we will build. NOTE: The files are in Windows/DOS format and have a ‘carriage-return’ at the end of each line (seen as ^M) and some of the molecules have padding spaces to the end of each line. I used the text editor ‘Vim’ to clean this up, as it caused problems with the processing.
CALCULATING DESCRIPTORS To calculate descriptors for each molecule in the test and training sets, run the CDKDescUI.jar2 program and browse to input the SDF file, then designate a new text file to capture the output descriptors. This program can calculate over 490 descriptors for each molecule, however for this example I chose a subset of my favorites to calculate as shown below. The test set was calculated with the same set of descriptors. I now have two files with over 200 columns of descriptors for each compound. See example results below.
DATA EXPLORATION IN KNIME A simple workflow to examine the distribution of the descriptors can easily be set up and run interactively. It involves just two nodes, the File Reader and the Histogram interactive nodes. The “File Reader” is found under the IO Node Repository and the “Histogram (interactive)” node is under the Data Views Node Repository.
For example, running this simple workflow on the previously calculated training descriptor file and examining the ‘ALOGP’ column yields a distribution histogram for all the ALOGP values. One can explore any number of columns in the same way.
MODEL BUILDING IN KNIME The workflow below builds a Random Forest model from the descriptors and ACTIVE/INACTIVE class labels that are input as separate files. A “joiner” node puts the two together and the Weka “RandomForest” (RF) node builds the model. (I am assuming the reader has familiarity with the RF model building steps and will explore the options available to build a validated model). Another node from the Weka set, “Predictor” will accept the model builder output as input and additional compounds with descriptors can be predicted ACTIVE or INACTIVE. Finally to check how we did, the class labels for the test set are input and “joined” to feed into an “Interactive Table” viewer and a “Scorer” node to view the results in tabular form and create a confusion matrix for the test set predictions. Results can be written to ASCII files, PDFs, HTML reports and Excel file formats. Once you get the feel of how KNIME works there are many, many nodes available to explore data. I recommend the QuickStart Tutorial for beginners on the KNIME web site as a place to start.
Example test set predictions are shown from the “Interactive Table” and the Confusion Matrix from the “Scorer”. No attempt was made to optimize the RF modeling parameters in this example.
Once the model parameters are tuned and validated with appropriate test sets, the model can be used to “virtually screen” as many compounds as can be run through the descriptor calculator and subsequently predicted with the RF model. The results I show above are not validated, but are shown for example. I hope this brief overview gives a good idea how KNIME workflows can be setup to do Virtual Screening. Download your own copy and take a look. KNIME is a powerful data mining and modeling tool.
Greg has kindly donated the a zipped workflow that can be downloaded here RF_model.zip
1. KNIME, Konstanz Information Miner, Univ. Konstanz, http://www.knime.org/ 2. CDKDescUI.jar – java program to calculate molecular descriptors, found at: http://rguha.net/code/java/cdkdesc.html 3. Robert N. Jorissen and Michael K. Gilson. J. Chem. Inf. Model, 2005, 45 (3), 549-561.