Flagging Duplicates Version 2
When working with multiple data sets of molecules, particularly if combining them from multiple sources, one of the most common tasks is removal of duplicates. This can be a time-consuming and error prone process if carried out manually and this script should hopefully make this a much easier task.
The script uses InChiKeys to compare for potential duplicate structures, the 27 character standard InChIKey is a hashed version of the full standard InChI, and was designed to allow for easy identification of chemical compounds. Standard InChi strings for large molecules may contain >1000 characters. For more details on the InChiKey read J Cheminform. 2013; 5: 7 DOI.
NOTE ! There is a bug in some versions of Vortex such that the InChiKey is not generated, this was fixed it in versions 42289 and later.
The first part of the script creates two new columns, one for the InChiKey and the other for the duplicate flag, then we calculate the InChiKey and populate the table.
In version 1 of the script duplicate searching was achieved by taking each individual InChiKey and searching through the table flagging those that are duplicates, whilst this works it is rather slow for large datasets. In an effort to improve performance Matt rewrote part of the script
In the original script, duplicates were identified by iterating over each molecule and comparing it pairwise to every single other molecule. This has a time complexity of O(N2), which means performance increasingly becomes an issue for larger collections of molecules. One way to improve this is to make use of a hash table type data structure, which allows us to store and lookup values with O(1) complexity. A simple way to use this in our duplicates script is to build a python dictionary that maps inchikeys to row numbers as we generate the inchikeys. Then we only have to iterate once over this dictionary of inchikeys, and flag rows as duplicates where there is more than one row associated with a given inchikey.
The Vortex Script
# A script to flag duplicate structures V2 # # Vortex imports import com.dotmatics.vortex.util.Util as Util import com.dotmatics.vortex.mol2img.jni.genImage as genImage import com.dotmatics.vortex.mol2img.Mol2Img as mol2Img import jarray import binascii import string import os import sys from collections import defaultdict inchikeys = defaultdict(list) InChIKeyColumn = vtable.findColumnWithName("InChIKey",1) DupColumn = vtable.findColumnWithName("DupFlag",1) rows = vtable.getRealRowCount() for r in range(0, int(rows)): try: mol = vtable.molFileManager.getMolFileAtRow(r) inChIKey = vortex.getMolProperty(mol, 'InChIKey') InChIKeyColumn.setValueFromString(r, inChIKey) inchikeys[inChIKey].append(r) except: pass vtable.fireTableStructureChanged() for inchikey, rows in inchikeys.items(): if len(rows) > 1: for r in rows: DupColumn.setValueFromString(r, "Duplicate") vtable.fireTableStructureChanged()
In order to test the performance I took around 150,000 random structures from ChEMBL and then duplicated 0.01% to give a test set of 160,146 molecules. The original version of the script took 95 mins, using the same test set, version 2 of the script took less than 3 mins! This increase in performance means that it is now practical to use the script on much larger datasets.
The script can be downloaded from here
Page Updated 3 August 2015