Macs in Chemistry

Insanely great science

 

Analysis of Categories

I often need to tag individual molecules within a dataset with a specific property, perhaps the results of clustering algorithms, the results of PAINS filtering, or Liver toxicity filters. Alternatively if you have a drug discovery project with multiple chemotypes you might want to tag particular groups of compounds as belonging to a named series to aid analysis.

categories

A question that might then arise is “How many molecules belong to each category?”. Whilst you can see the numbers in the sidebar there is not an easy way to export the results.

pains

After discussions with Dan and Matt this script evolved. The script allows you create a new workspace containing the category information.

The first part of the script allows the user to select the categorical column, we then identify the column and its name.

col = vtable.getColumn(input_idx - 1)
colName =  vtable.getColumnName(input_idx - 1)

We then use a defaultdict, this works exactly like a normal dictionary, but it is initialized with a function (“default factory”) that takes no arguments and provides the default value for a nonexistent key. Also without using defaultdict, we need to check if our category had been assigned yet before we can add 1.

answer = defaultdict(int)
for r in range(0, int(rows)):
            k = col.getValueAsString(r)
            answer[k] += 1

Finally we sort by count decreasing and we then create a new workspace with two columns, first containing the categories the second the count of occurrences for each category as shown in the examples below.

painsclusters livertoxoutput clusteroutput

I’ve tested the script on a data set of 161,000 molecules and it took less than 1 second to complete for a variety of types of categories.

The Vortex Script

#Analysis of categorical information
#Authored by Chris Swain (http://www.macinchem.org)
#All rights reserved.

import jarray
import binascii
import string
import os
from collections import defaultdict

input_label     = swing.JLabel("Cluster Column (for input)")
input_cb    = workspace.getColumnComboBox()

panel = swing.JPanel()

layout.fill(panel, input_label, 0, 0)
layout.fill(panel, input_cb,    1, 0)

ret = vortex.showInDialog(panel, "Choose Cluster column")

if ret == vortex.OK:
    input_idx = input_cb.getSelectedIndex()

    if input_idx == 0:
        vortex.alert("you must choose a column")
    else:

        col = vtable.getColumn(input_idx - 1)
        colName =  vtable.getColumnName(input_idx - 1)

        # defaultdict has key-value pairs with a default 0 making it easier to increment by one
        answer = defaultdict(int)

        # For each row in table, increment the count value for the cluster key by 1
        rows = vtable.getRealRowCount()
        for r in range(0, int(rows)):
            k = col.getValueAsString(r)

            answer[k] += 1

        # This sorts by count decreasing    
        answer = sorted([[k, v] for k, v in answer.items()], key=lambda x: -x[1])

        # Output to new table
        arrayToWorkspace(answer,[colName, 'COUNT'], 'Cluster Analysis Output')

The script can be downloaded here https://macinchem.org/reviews/vortex_scripts/GenericClusterAnalysis.vpy.zip.

Page Updated 28 September 2016