Macs in Chemistry

Insanely Great Science

How to access Open Targets with R


A blog post describing a package that implements an R client to extract data from the Target validation platform.

The Open Targets Platform is a comprehensive and robust data integration for access to and visualisation of potential drug targets associated with disease. It brings together multiple data types and aims to assist users to identify and prioritise targets for further investigation.

This is an alternative to the public REST API.


9000 packages on CRAN


The latest update to the CRAN R archive brings the total number of packages to 9004.


2016-08-22: 9000 packages
2016-02-29: 8000 packages
2015-08-12: 7000 packages
2014-10-29: 6000 packages
2013-11-08: 5000 packages
2012-08-23: 4000 packages
2011-05-12: 3000 packages
2009-10-04: 2000 packages
2007-04-12: 1000 packages
2004-10-01: 500 packages
2003-04-01: 250 packages

There is a listing of data analysis tools for Mac OSX here.


RRegrs: an R package for computer-aided model selection with multiple regression models


I just thought I'd flag a paper in Journal of Cheminformatics, RRegrs: an R package for computer-aided model selection with multiple regression models DOI.

We propose an integrated framework for creating multiple regression models, called RRegrs. The tool offers the option of ten simple and complex regression methods combined with repeated 10-fold and leave-one-out cross-validation. Methods include Multiple Linear regression, Generalized Linear Model with Stepwise Feature Selection, Partial Least Squares regression, Lasso regression, and Support Vector Machines Recursive Feature Elimination. The new framework is an automated fully validated procedure which produces standardized reports to quickly oversee the impact of choices in modelling algorithms and assess the model and cross-validation results. The methodology was implemented as an open source R package, available at, by reusing and extending on the caret package.


Poll on data analysis tools


The results of the 16th annual KDnuggets Software Poll on data analysis tools is in.

The top 10 tools by share of users were

R, 46.9% share ( 38.5% in 2014, 37% in 2013)
RapidMiner, 31.5% ( 44.2% in 2014, 39% in 2013)
SQL, 30.9% ( 25.3% in 2014, NA in 2013)
Python, 30.3% ( 19.5% in 2014, 13% in 2013)
Excel, 22.9% ( 25.8% in 2014, 28% in 2013)
KNIME, 20.0% ( 15.0% in 2014, 6% in 2013)
Hadoop, 18.4% ( 12.7% in 2014, 9% in 2013)
Tableau, 12.4% ( 9.1% in 2014, NA 2013)
SAS, 11.3 (10.9% in 2014, 10.7% in 2013)
Spark, 11.3% ( 2.6% in 2014, NA in 2013)

The results very much reflect my own interactions, whilst R has a significant installed user base and of course a vast repository of open source packages, Python seems to be gaining traction. Certainly in part because Python seems to have become the lingua franca for scientific computing.

I've always thought of KNIME and Tableau as excellent tools for implementing workflows but looking at recent iterations it is clear there is now greater emphasis on interactive analysis.

There is a listing of data analysis tools for Mac OS X here.


R Instructor


Whilst R is a very comprehensive statistical and data analysis package it does have a very steep learning curve.

R Instructor is an iPhone, iPad and iPod Touch application that uses plain, non-technical language and over 30 videos to explain how to make and modify plots, manage data and conduct both parametric and non-parametric statistical tests.

R-Instructor for iOS

Now added to the mobile science site.


R version 3.1.1 released.


I just noticed that R was updated last month to version 3.1.1 ((Sock it to Me).

R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms.

I’ve also updated the list of Data Analysis packages for Mac OSX.




ChemmineOB provides an R interface to a subset of cheminformatics functionalities implemented by the OpelBabel C++ project. OpenBabel is an open source cheminformatics toolbox that includes utilities for structure format interconversions, descriptor calculations, compound similarity searching and more. ChemineOB aims to make a subset of these utilities available from within R. For non-developers, ChemineOB is primarily intended to be used from ChemmineR as an add-on package rather than used directly.

More details here


R update


I just noticed that there is an update to R on the CRAN website

This binary distribution of R and the GUI supports 64-bit Intel based Macs on Mac OS X 10.6 (Leopard) or higher. Since R 3.0.0 the binary is a single-arch build and contains only the x86_64 (64-bit Intel) architecture. PowerPC Macs and 32-bit Macs are only supported by building from sources or by older binary R versions. The default package type is "mac.binary" and the binary repository layout has changed accordingly.

There is a listing of data analysis packages for Mac OS X here.


fmcsR: Mismatch Tolerant Maximum Common Substructure Searching in R

I’m not a big user of R a free software environment for statistical computing and graphics, but occasionally I notice cheminformatics modules being published. The latest issue of Bioinformatics DOI has a paper describing “fmcsR: Mismatch Tolerant Maximum Common Substructure Searching in R”.

The fmcsR package provides an R interface, with the time consuming steps of the FMCS algorithm implemented in C++. It includes utilities for pairwise compound comparisons, structure similarity searching, clustering and visualization of MCSs. In comparison to an existing MCS tool, fmcsR shows better time performance over a wide range of compound sizes. When mismatching of atoms or bonds is turned on, the compute times increase as expected, and the resulting FMCSs are often R1C5 substantially larger than their strict MCS counterparts. Based on R1C6 extensive virtual screening (VS) tests, the flexible matching feature enhances the enrichment of active structures at the top of MCS-based similarity search results. With respect to overall and early enrichment performance, FMCS outperforms most of the seven other VS methods considered in these tests.

fmcsR is freely available for all common operating systems from the Bioconductor site


ChemmineR updated

ChemmineR a cheminformatics package for analyzing drug-like small molecule data in R was recently updated. Its latest version contains functions for efficient processing of large numbers of molecules, physicochemical/structural property predictions, structural similarity searching, classification and clustering of compound libraries with a wide spectrum of algorithms. In addition, it offers visualization functions for compound clustering results and chemical structures.

To install, start R and enter



R reaches version 3.0.0

R the language and environment for statistical computing and graphics has now reached version 3.0.0.

Whilst there is a list of new features and updates, those listed as most significant are shown below.

  • Packages need to be (re-)installed under this version (3.0.0) of R.
  • There is a subtle change in behaviour for numeric index values 2^31 and larger. These never used to be legitimate and so were treated as NA, sometimes with a warning. They are now legal for long vectors so there is no longer a warning, and x[2^31] <- y will now extend the vector on a 64-bit platform and give an error on a 32-bit one.
  • It is now possible for 64-bit builds to allocate amounts of memory limited only by the OS. It may be wise to use OS facilities (e.g. ulimit in a bash shell, limit in csh), to set limits on overall memory consumption of an R process, particularly in a multi-user environment. A number of packages need a limit of at least 4GB of virtual memory to load. 64-bit Windows builds of R are by default limited in memory usage to the amount of RAM installed: this limit can be changed by command-line option --max-mem-size or setting environment variable RMAXMEM_SIZE.
  • Negative numbers for colours are consistently an error: previously they were sometimes taken as transparent, sometimes mapped into the current palette and sometimes an error.

There is a list of data analysis packages for MacOSX here.


Wizard Updated

Wizard the point-and-click statistical analysis for Mac has been updated.

The focus of this release is supporting several new import formats, including the oft-requested XLSX and Numbers document formats.

A major change in the product line is that reading and writing R files and generating R code has now "graduated" from the Pro version and is now available in the Standard version. But Pro users shouldn't feel left out: with this release, Support for importing binary SAS files and generating SAS code -- both features only available in the Pro version.

New Features:

  • Import XLSX spreadsheets
  • Import Numbers documents

New Features (Pro Version):

  • Import SAS binary files (.sas7bdat)
  • Import plain-text data with SAS commands (.sas)
  • Generate SAS model estimation commands

New Features (Standard Version):

  • Import/export R files
  • Generate R commands

Bug fixes

  • Fix a crash when zero observations are included in the Model view
  • Fix a bug when importing multiple sheets in XLS documents
  • Fix a bug where Q-Q plots were not properly exported as PDF

There is a listing of data analysis tools for the Mac here.


KNIME 2.7 released

KNIME 2.7 has been released.

KNIME now runs on Java 7 for Windows and Linux systems (Mac stays on  Java 6) Eclipse update 3.7 increases stability on Mac and some Linux systems. BIRT 3.7 brings Open Office support among other new features

JFreeChart nodes have now more setting options in the “General Plot Options” tab of their configuration window.
In R-> Local there are a number of new nodes to import:

  1. “Table to R” can read a KNIME table into R and output the R workspace.  
  2. “R to Table” takes an R workspace and outputs a KNIME table.
  3. “R +Data to R” takes an R workspace and optional data input and outputs an R workspace.
  4. “R to R-View” takes an R workspace and outputs a KNIME view

There is a KNIME tutorial here


Chemical Fingerprints

chemfp is a free set of command-line tools, and the underlying Python software library, for generating cheminformatics fingerprint files and searching them based on Tanimoto similarity. Read More...