Macs in Chemistry

Insanely Great Science

Finding Duplicate structures


It is always interesting to note which scripts attract the most attention, often it is scripts that aid with relatively simple tasks. Among the Applescripts it is the script to simply print the clipboard.

Recently I wrote a script to remove duplicate structures from within Vortex

When working with multiple data sets of molecules, particularly if combining them from multiple sources, one of the most common tasks is removal of duplicates. This can be a time-consuming and error prone process if carried out manually and this script should hopefully make this a much easier task.

This seems to have attracted interest but I got a comment that it "works fine but is slow for larger data sets". So I've been looking at improving performance.

In order to test the performance I took around 150,000 random structures from ChEMBL and then duplicated 0.01% to give a test set of 160,146 molecules. The original version of the script took 95 mins, using the same test set, version 2 of the script took less than 3 mins! This increase in performance means that it is now practical to use the script on much larger datasets.

You can read full details and download it here.

There are many more Hints, scripts and tutorials here.

blog comments powered by Disqus