Macs in Chemistry

Insanely great science


Flagging Duplicates

When working with multiple data sets of molecules, particularly if combining them from multiple sources, one of the most common tasks is removal of duplicates. This can be a time-consuming and error prone process if carried out manually and this script should hopefully make this a much easier task.

The script uses InChiKeys to compare for potential duplicate structures, the 27 character standard InChIKey is a hashed version of the full standard InChI, and was designed to allow for easy searches of chemical compounds. Standard InChi strings for large molecules may contain >1000 characters. For more details on the InChiKey read J Cheminform. 2013; 5: 7 DOI.

NOTE ! There is a bug in some versions of Vortex such that the InChiKey is not generated, this was fixed it in versions 42289 and later.

The first part of the script creates two new columns, one for the InChiKey and the other for the duplicate flag, then we calculate the InChiKey and populate the table.

The duplicate searching is achieved by taking each individual InChiKey and searching through the table flagging those that are duplicates.

The Vortex Script

# A script to flag duplicate structures

# Vortex imports
import com.dotmatics.vortex.util.Util as Util
import com.dotmatics.vortex.mol2img.jni.genImage as genImage
import com.dotmatics.vortex.mol2img.Mol2Img as mol2Img
import jarray
import binascii
import string
import os
import sys

InChIKeyColumn = vtable.findColumnWithName("InChIKey",1)
DupColumn = vtable.findColumnWithName("DupFlag",1)

rows = vtable.getRealRowCount()
for r in range(0, int(rows)):
    mol = vtable.molFileManager.getMolFileAtRow(r)
    inChIKey = vortex.getMolProperty(mol, 'InChIKey')
    InChIKeyColumn.setValueFromString(r, inChIKey)


rows = vtable.getRealRowCount()
for r in range(0, int(rows)):
    SearchInchi = InChIKeyColumn.getValue(r)
    for t in range(0, int(rows)):
        indInchi = InChIKeyColumn.getValue(t)
        if SearchInchi == indInchi:
            n = n+1
            if n >1:
                DupFlagVal = "Duplicate"
                DupColumn.setValueFromString(r, DupFlagVal)


The results


The script can be downloaded from here

Page Updated 16 July 2015