There have been many estimates of the size of chemical space, an oft quoted number 1060 a number large enough to be effectively infinite.
At the beginning of December 2020 the NIH held a workshop looking at ultra large chemistry databases
Program Dec. 1, 2020
10:45 Susan Gregurick Welcome
11:00 Yurii Moroz Making virtual REAL: an Approach to Access Billions of Make-on-Demand Compounds
11:30 Daniel Kuhn Searching for novel chemical hit matter in large chemical spaces
12:00 Uta Lessel Boehringer Ingelheim Comprehensive Library of Accessible Innovative Molecules (BICLAIM)
12:30 Zhijie Liu Build & Explore Virtual Libraries for Drug Discovery Projects
1:00 Christos Nicolaou Idea2Data: Expediting Drug Discovery through Proximal Library Exploitation
1:30 Jason Deng & John Shirley Introduction to DEL informatics and Virtual Spaces at WuXi AppTec
2:00 Jennifer Elward Exploring GSK Space: Practical Application of Large Scale Virtual Screening
2:30 Venkatesh Mysore Screening Billions of Compounds on the AtomNet Model: Approaches and Future Directions
Dec. 2, 2020
11:00 Rick Stevens A Large-scale (4.2 Billion Molecules, 60TB) Compound Feature db for Deep Learning in Virtual Drug Screening
11:30 Vladimir Poroikov Revealing Antiviral Hits Among Billion Molecules with Ligand and Target-based Approaches
12:00 Jean-Louis Reymond The GDB Databases and Their Use for Drug Discovery
12:30 Matthias Rarey Combinatorial Approaches for Searching Synthetically Accessible Chemical Space
1:00 John Irwin Virtual Screening of Ultra Large Chemistry Databases
1:30 Gergely Zahoranszky-Kohalmi Integrated Computational Platform for Chemistry Automation
2:00 Tudor Oprea , The Art of Navigating in Chemical Bioactivity Space
2:30 Marc Nicklaus & Nadya Tarasova SAVI: Billions of Easily Synthesizable Compounds Generated with Expert-System Rules
3:00 Jim Brase ATOM – Scalable Deep Learning of Generative Models for Molecular Design Optimization
Dec. 3, 2020
11:00 Andrew Dalke & Brian Cole Compression of Chemfp Databases
11:30 Roger Sayle Advances in Searching Ultra-Large (100+ Billion Compound) Compound Chemical Databases: Arthor and SmallWorld
12:00 Christian Lemmen Efficient 3D Exploration of Multi-Billion Compound Spaces
12:30 Lutz Weber & Christoph Ruttkies SciWalker Next Generation - a Novel Comprehensive Semantic Chemistry Search Engine for Heterogeneous Documents and Databases
1:00 Wolf-Dietrich Ihlenfeldt Cloud Databases and Chemical Structure Searching
1:30 Evan Bolton Chemical Space is Infinite: How Can One Scale to Infinity While Still Being Usable/Useful
2:00 Ian Wetherbee & Stephen Boyer A Collaborative Database for Chemistry in Google BigQuery
2:30 Eugene Raush Chemical Substructure Search in Ultra Large Chemical Databases: Fast Virtual Screening with Rapid Isoster Discovery Engine (RIDE)
3:00 Mark McGann GigaDocking: Structure Based Virtual Screening of Billion of Molecules
There presentations are now available online. https://cactus.nci.nih.gov/presentations/NIHBigDB_2020-12/NIHBigDB.html.
Whilst there are many commercial packages for creating structure searchable chemical databases there is little in the way of Open Source packages, in particular a solution that provides a web front end. There is the RDKit PostgreSQL cartridge however installing PostgreSQL and building the database is probably a step to far for those unfamiliar with the use of the command line.
I recently came across ChemRPS whilst this uses the same RDKit PostgreSQL cartridge a search engine (API) and a preconfigured webserver with register/search web pages including structure editor Ketcher from EPAM, the installation comes as a Docker image which should make things much easier.
The system had not been tested on a Mac so I've detailed the instructions in this review…
OraRdkitCart is an Oracle data cartridge/extensible index to allow substructure and similarity searching using SQL queries on tables which contain indexed chemical structures.
It uses a Java RMI server and RDKit wrappers for chemical structure handling.
The cartridge has been tested on Oracle 12C and Oracle 18C. It would be expected to run on Oracle 19C, but has not yet been tested.
Full details on GitHub https://jones-gareth.github.io/OraRdKitCart/index.html
Cambridge Crystallographic Data Centre (CCDC) announced the first release of CSD data and software update of 2019.
The 2019 CSD Release contains 957,868 unique structures and 973,630 entries (CSD version 5.40) – an increase of more than 57,000 entries. We are currently on course to reach a million structures by summer 2019.
The update includes an exciting new polyhedra display option in our visualisation software Mercury.
Read more here….
Great work by NextMove, an open, machine-readable, freely-reusable, annotated reaction data set, available for download here https://figshare.com/articles/ChemicalreactionsfromUSpatents1976-Sep2016/5104873
Reactions extracted by text-mining from United States patents published between 1976 and September 2016. The reactions are available as CML or reaction SMILES. Note that the reactions SMILES are derived from the CML.
For convenience the reaction SMILES includes tab delimited columns for: PatentNumber, ParagraphNum, Year, TextMinedYield, CalculatedYield
Now that we have a large initial data set it would be great if others could contribute using the same format.
There is a fabulous detailed review of this invaluable resource on the Depth-First blog http://depth-first.com/articles/2019/01/28/the-nextmove-patent-reaction-dataset/
There is a great blog article on ChEMBL-og, describing their work evaluating chemical structure based searching in MongoDB. MongoDB is a NoSQL database designed for scalability and performance that is attracting a lot of interest at the moment.
The article does a great job in explaining the logic behind improving the search performance.
They also provide an iPython notebook so you can try it yourself.
Following on from the release of ChEMBL 20 earlier in the year we now see the release of the MyChEMBL virtual machines supporting a CentOS-based image, along with the existing Ubuntu version. What might be of interest to Mac OS X users is are myChEMBL Docker images.
Docker is an open platform for building, shipping and running distributed application. Docker scontainers wrap up a piece of software in a complete filesystem that contains everything it needs to run: code, runtime, system tools, system libraries – anything you can install on a server. This guarantees that it will always run the same, regardless of the environment it is running in. Installation under Linux is straightforward and instructions for Mac OS X are provided.
Installation on OS X is more complicated. This is because the standard OS X installation downloads and configures VirtualBox and runs a very lightweight 64-bit Linux with docker installed. Now the problem is, that it won't work in case of myChEMBL. This is because this Virtual Machine has only 20GB of available disk space and our myChEMBL container is 23GB after decompressing. So in order to use it, you first have to resize the volume, which is explained here: https://docs.docker.com/articles/b2dvolumeresize/.
Once done the steps are very simple:
Download the MyChEMBL image from the FTP. Uncompress Load image into docker Run it
After successful completion of the steps above, you can open you browser and go to http://127.0.0.1/ if you are running docker locally or http://someotherhost/ if you are running docker on some other host. You should then be able to see myChEMBL launchpad page.
The ChemSpider Website has been updated.
ChemSpider is a free chemical structure database providing fast text and structure search access to over 34 million structures from hundreds of data sources.
A few applications have been updated over the week or so.
Wizard Pro for Mac has been updated to version 1.7.0 highlights from the update include:-
1-click Data Refresh: Suppose you've imported and cleaned your data, perhaps built a few models -- and then your data changes. Now, thanks to the new "Refresh" button in the toolbar, you can instantly update all of your analyses using fresh data from the original source. Customize how columns are matched up with a convenient popover, and feel free to move or rename the source data file on your computer -- Wizard will automatically keep an eye on it. Command-R to refresh the data, Command-E to configure the link.
Revamped menu system: Wizard has a new modular architecture that means you'll only see menus relevant to what you're doing -- that is, Raw Data, Pivot, Summary, Model, and Predict each have their own menu now. Most of the menus are more concise, so you can find what you're looking for faster.
There is a review of Wizard Pro here.
Plot2 a scientific 2D plotting program designed for everyday plotting, it is easy to use, it creates high quality plots, and it allows easy and powerful manipulations and calculations of data. The latest update fives and export bug.
MyScript Calculator for iOS, fixes problems with tutorial being played repeatedly and a drag and drop bug.
ChEMBL 20 has been released.
The updated database contains
- * 1,715,135 compound records
- * 1,463,270 compounds (of which 1,456,020 have mol files)
- * 13,520,737 activities
- * 1,148,942 assays
- * 10,774 targets
- * 59,610 source documents
A number of structural alerts have now been added these include Pfizer LINT filters, Glaxo Wellcome Hard Filters, Bristol-Myers Squibb HTS Deck Filters, NIH MLSMR Excluded Functionality Filters, University of Dundee NTD Screening Library Filters and Pan Assay Interference Compounds (PAINS) Filters. The PAINS annotation was created using the Vortex script described here.
I was recently sent details of a new website Chemplore the aim is to provide an modern, interactive and easy way to visualize small molecules and macromolecules in the browser. It's built using many modern web technologies and tools including WebGL, SVG and Go.
It pulls data from a variety of sources including PubChem and PDB, and provides interactive 2D and 3D viewers plus a variety of chemical information.
It is currently beta and the developers are looking for feedback
I just got this email
iScienceSearch, the Internet search engine for chemists is now completely free! Please click http://isciencesearch.com/iss to start the application. There is nothing to download. This application will run in your browser.
I’ve previously reviewed iScienceSearch and it seem to have been updated considerably since then.
iScienceSearch is a meta search engine that searches over 100 different databases, The search engine is intelligent and will search using any synonyms or chemical structures of your search query to extend the search to data sources that might not include the original query text.
- Search the Internet by drawing a structure.
- Type a name or identifier and you get the structure.
- Find suppliers for lab and research chemicals
- Search the AKosSamples database by substructure
I suspect most scientists are now finding that they are storing data in SQL databases and I noticed that Impathic have released a series of tools to access a variety of SQL databases from your iOS device, so whether you are using MySQL, Oracle, Access etc there is probably a dataglass app to give you access.
MongoDB (from "humongous") is an open-source object orientated document database.
Classified as a NoSQL database, MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster.
As you might expect chemical searching is not something that is traditionally supported, but there have been a couple of blog articles describing initial efforts, and there is now a detailed step by step description available. The post described implementation of chemical similarity searching using MongoDB and RDKit fingerprints it also has some initial comparisons with the more traditional SQL implementation using the RDKit PostgreSQL cartridge.
I thought I would highlight a recent publication I read in Journal of Cheminformatics “Molecule database framework: a framework for creating database applications with chemical structure search capability” Journal of Cheminformatics 2013, 5:48 DOI.
From the abstract
Molecule Database Framework is written in Java and I created it by integrating existing free and open-source tools and frameworks. The core functionality includes:Chemical structure searches combined with property searches. Support for multi-component compounds (mixtures) mport and export of SD-files. Optional security (authorization). For chemical structure searching Molecule Database Framework leverages the capabilities of the Bingo Cartridge for PostgreSQL and provides type-safe searching, caching, transactions and optional method level security. Molecule Database Framework supports multi-component chemical compounds (mixtures). Furthermore the design of entity classes and the reasoning behind it are explained. By means of a simple web application I describe how the framework could be used. I then benchmarked this example application to create some basic performance expectations for chemical structure searches and import and export of SD-files.
While not a drag and drop solution it provides a means to create your own personal chemically searchable database.
Molecule Database Framework is available for download on the projects web page on bitbucket: https://bitbucket.org/kienerj/moleculedatabaseframework.