I previously mentioned a comparison of various tools to cluster large datasets. I've now updated the Vortex to allow the user to select the centroid of each cluster. I tried it on a 4.3 million structure clustered dataset and the script only took a few seconds to run.
The page on clustering is here and the Vortex script can be downloaded here http://macinchem.org/reviews/vortex_scripts/ChoseCentreFromClusters.vpy.zip.
Clustering is an invaluable cheminformatics technique for subdividing a typically large compound collection into small groups of similar compounds. One of the advantages is that once clustered you can store the cluster identifiers and then refer to them later this is particularly valuable when dealing with very large datasets. This often used in the analysis of high-throughput screening results, or the analysis of virtual screening or docking studies.
On this page I've explored multiple options for clustering, from Open Source toolkits to sophisticated desktop applications.