Macs in Chemistry

Insanely Great Science

Chemfp

 

Just got this message which I thought readers might be interested in

chemfp 1.5 is now available from http://dalkescientific.com/releases/chemfp-1.5.tar.gz and from PyPI (the Python package index) through "pip install chemfp".

The software is available in source code form under the MIT license. For more information see the home page at http://chemfp.com/ or the documentation page at https://chemfp.readthedocs.io/en/chemfp-1.5/ .

Chemfp is a set of command-line tools and a Python library for working with cheminformatics fingerprints. It can use OEChem/OEGraphSim, RDKit, or Open Babel to create fingerprints in the FPS format, and it implements a high-speed Tanimoto search.

As far as I can tell, chemfp 1.5 is the fastest free/open source fingerprint search system for the CPU. (Some proprietary/commercial toolkits are faster, including the commercial version of chemfp, and GPU-based search is usually faster than the CPU.)

The main changes for this release are:

  • 10% faster performance for k-nearest search
  • fixed a bug in symmetric k-nearest neighbor when multiple fingerprints have no bits set
  • improved the use of chemfp as a baseline benchmark for similarity search tools

Similarity search performance benchmark

Concerning the last point, I have assembled a data set which can be used to benchmark similarity search performance for several different search types, fingerprint types, and scoring functions. This includes pre-computed fingerprints and expected search results, as well as timing numbers for several different versions of chemfp.

My hope is that it evolves into a standard benchmark that help evaluate search tools - bearing in mind that performance is only one of many factors that go into selecting a tool.

The benchmark files are at https://bitbucket.org/dalke/chemfp_benchmark . Those files which fall under copyright are distributed under the MIT license.

Many thinks to ChEMBL, OpenEye, PubChem, Open Babel, RDKit, and Daniel Lemire for providing the data and resources for putting this benchmark together.

Best regards,

Andrew dalke@dalkescientific.com


blog comments powered by Disqus