Exploration and sorting large datasets of molecules often involves the use of clustering techniques to group together molecules with similar properties. It can be used to flag outliers or highlight particular patterns, functional groups, or scaffolds. Whilst there are many clustering algorithms it is often a challenge to sort through and analysis the results.
With millions of structures it is not really practical to simply scroll through the table, but here are a selection of scripts that might help with further analysis.
I've recently become interested the comparison of the amino amino-acid composition of peptides, to allow comparison of cyclic versus linear peptides, or brain penetrant curses non-penetrant. I had a look around but could not find any tools that did this, in particular I wanted to include any non-proteinergic amino-acids.
This tutorial provides a means to analyse many thousands of peptides using Vortex.
Whilst the rule of 5 (Ro5) has provided a useful way to describe small molecule drug space it is also clear that there are a significant number of molecular classes that exist beyond the rule of 5 boundaries (bRo5). In a review of the AbbVie compound collection DOI they were able to identify key findings that might explain the success (or failure) of bRo5 projects. From an analysis of a variety of calculated physicochemical properties they proposed a simple multiparametric scoring function (AB-MPS) was devised that correlated preclinical PK results with cLogD, number of rotatable bonds, and number of aromatic rings.
AB-MPS = Abs(cLogD-3) + NAR + NRB
I've updated the Vortex script to run text based queries of PubMed.
If you regularly use the E-utilities API you might want to read this.
After May 1, 2018, NCBI will limit your access to the E-utilities unless you have one of these keys. Obtaining an API key is quick, and simple, and will allow you to access NCBI data faster. If you don’t have an API key, E-utilities will still work, but you may be limited to fewer requests than allowed with an API key.
After May 1, 2018, any computer (IP address) that submits more than 3 E-utility requests per second will receive an error message. This limit applies to any combination of requests to EInfo, ESearch, ESummary, EFetch, ELink, EPost, ESpell, and EGquery.
If you write software of scripts that access the E-utilities API then the users will need to get their own api key. Calls will have this format
I've updated this script to reflect this change, and I've highlighted where you need to add your api key in the script. I've also tried to ensure that any query string should be encoded to make it URL safe and I've extended the search range up to 2018.
The RCSB Protein Data Bank is an absolutely invaluable resource that provides archive-information about the 3D shapes of proteins, nucleic acids, and complex assemblies that helps scientists understand all aspects of biomedicine and agriculture, from protein synthesis to health and disease. Currently the PDB contains over 134,000 data files containing structural information on 42547 distinct protein sequences of which 37600 are human sequences. They also provide a series of tools to search, view and analyse the data.
The latest addition to the Hints and Tutorials page is a couple of Vortex scripts for interacting with the RCSB Protein Data Bank, specifically they search for PDB structures associated with a list of Uniprot codes, and then search for associated information. Read more here…