Counting Identical structures in two datasets
Sometimes I have two datasets and I just want to know the overlap of identical structures. This script counts the number of identical structures by comparing InChIKeys. We start by reading two files into separate workspaces.
The next part of the script generates the InChiKey for each molecule in both workspaces. We then check for duplicates first in each table, and then for duplicates between the tables. A new workspace is then generated with the results as shown below.
The figure in the top left of the Matrix (1137) is the number of unique structures there are in "PublishedFragments", the number in the bottom right (1500) is the number of unique structures in DiverseFragmentLibrary. In the case of "PublishedFragments" this is actually less than the numbers in the workspace, this is because there are a number of duplicate structures in that file. The figure of 120 corresponds to how many identical structures there are between the two datasets.
The Vortex Script
# A script to count identical structures in two workspaces # Vortex imports import com.dotmatics.vortex.util.Util as Util import com.dotmatics.vortex.mol2img.jni.genImage as genImage import com.dotmatics.vortex.mol2img.Mol2Img as mol2Img import jarray import binascii import string import os import sys from collections import defaultdict mycount = vortex.getWorkspaceCount() #number of workspaces col_names = ["Database"] #array results =  #2D array inchikeys1 = defaultdict(list) vws = vortex.getWorkspace(0) #first workspace vtable1 = vws.getTable() myname1 = vws.getName() #vortex.alert(myname1) col_names.append(myname1) InChIKeyColumn = vtable1.findColumnWithName("InChIKey",1) rows = vtable1.getRealRowCount() for r in range(0, int(rows)): try: mol = vtable1.molFileManager.getMolFileAtRow(r) inChIKey = vortex.getMolProperty(mol, 'InChIKey') InChIKeyColumn.setValueFromString(r, inChIKey) inchikeys1[inChIKey].append(r) except: pass vtable.fireTableStructureChanged() inchikeys2 = defaultdict(list) vws = vortex.getWorkspace(mycount -1) #second workspace vtable2 = vws.getTable() myname2 = vws.getName() InChIKeyColumn = vtable2.findColumnWithName("InChIKey",1) rows = vtable2.getRealRowCount() for r in range(0, int(rows)): try: mol = vtable2.molFileManager.getMolFileAtRow(r) inChIKey = vortex.getMolProperty(mol, 'InChIKey') InChIKeyColumn.setValueFromString(r, inChIKey) inchikeys2[inChIKey].append(r) except: pass vtable.fireTableStructureChanged() #looking for duplicates within each table dupsdict1 = [k for k in inchikeys1 if k in inchikeys1] len_dupsdict1 = len(dupsdict1) dupsdict2 = [k for k in inchikeys2 if k in inchikeys2] len_dupsdict2 = len(dupsdict2) #looking fr duplicates between tables dups = [k for k in inchikeys1 if k in inchikeys2] len_dups = len(dups) vortex.alert(len_dups) row1 = [myname1, len_dupsdict1 , len_dups] row2 = [myname2, len_dups, len_dupsdict2] col_names.append(myname2) results.append(row1) results.append() results.append(row2) #create new workspace with the results arrayToWorkspace(results, col_names, 'Identical')
The script can be downloaded here http://macinchem.org/reviews/vortex_scripts/IdenticalStructures.vpy.zip
Last updated 7 March 2018