Chemistry document classifier
The latest issue of J Cheminformatics has an article entitled “A document classifier for medicinal chemistry publications trained on the ChEMBL corpus”, Journal of Cheminformatics 2014, 6:40 doi:.
The large increase in the number of scientific publications has fuelled a need for semi- and fully automated text mining approaches in order to assist in the triage process, both for individual scientists and also for larger-scale data extraction and curation into public databases. Here, we introduce a document classifier, which is able to successfully distinguish between publications that are ‘ChEMBL-like’ (i.e. related to small molecule drug discovery and likely to contain quantitative bioactivity data) and those that are not. The unprecedented size of the medicinal chemistry literature collection, coupled with the advantage of manual curation and mapping to chemistry and biology make the ChEMBL corpus a unique resource for text mining.
The models, workflows and tools are freely available for download. https://github.com/chembl/chemblliteratureclassifier