# Python leads the 11 top Data Science, Machine Learning platforms

The results of the latest KDnuggets poll, which is in it's 20th year, are in. Python is clearly moving to become the dominant platform with the votes for R slowly declining.

The blog post on KDnuggets gives far more detailed analysis and is well worth reading.

# How to access Open Targets with R

A blog post describing a package that implements an R client to extract data from the Target validation platform.

The Open Targets Platform is a comprehensive and robust data integration for access to and visualisation of potential drug targets associated with disease. It brings together multiple data types and aims to assist users to identify and prioritise targets for further investigation.

This is an alternative to the public REST API.

# 9000 packages on CRAN

The latest update to the CRAN R archive brings the total number of packages to 9004.

Milestones:

2016-08-22: 9000 packages

2016-02-29: 8000 packages

2015-08-12: 7000 packages

2014-10-29: 6000 packages

2013-11-08: 5000 packages

2012-08-23: 4000 packages

2011-05-12: 3000 packages

2009-10-04: 2000 packages

2007-04-12: 1000 packages

2004-10-01: 500 packages

2003-04-01: 250 packages

There is a listing of data analysis tools for Mac OSX here.

# RRegrs: an R package for computer-aided model selection with multiple regression models

I just thought I'd flag a paper in Journal of Cheminformatics, RRegrs: an R package for computer-aided model selection with multiple regression models DOI.

We propose an integrated framework for creating multiple regression models, called RRegrs. The tool offers the option of ten simple and complex regression methods combined with repeated 10-fold and leave-one-out cross-validation. Methods include Multiple Linear regression, Generalized Linear Model with Stepwise Feature Selection, Partial Least Squares regression, Lasso regression, and Support Vector Machines Recursive Feature Elimination. The new framework is an automated fully validated procedure which produces standardized reports to quickly oversee the impact of choices in modelling algorithms and assess the model and cross-validation results. The methodology was implemented as an open source R package, available at https://www.github.com/enanomapper/RRegrs, by reusing and extending on the caret package.

# Poll on data analysis tools

The results of the 16th annual KDnuggets Software Poll on data analysis tools is in.

The top 10 tools by share of users were

R, 46.9% share ( 38.5% in 2014, 37% in 2013)

RapidMiner, 31.5% ( 44.2% in 2014, 39% in 2013)

SQL, 30.9% ( 25.3% in 2014, NA in 2013)

Python, 30.3% ( 19.5% in 2014, 13% in 2013)

Excel, 22.9% ( 25.8% in 2014, 28% in 2013)

KNIME, 20.0% ( 15.0% in 2014, 6% in 2013)

Hadoop, 18.4% ( 12.7% in 2014, 9% in 2013)

Tableau, 12.4% ( 9.1% in 2014, NA 2013)

SAS, 11.3 (10.9% in 2014, 10.7% in 2013)

Spark, 11.3% ( 2.6% in 2014, NA in 2013)

The results very much reflect my own interactions, whilst R has a significant installed user base and of course a vast repository of open source packages, Python seems to be gaining traction. Certainly in part because Python seems to have become the lingua franca for scientific computing.

I've always thought of KNIME and Tableau as excellent tools for implementing workflows but looking at recent iterations it is clear there is now greater emphasis on interactive analysis.

There is a listing of data analysis tools for Mac OS X here.

# R Instructor

Whilst R is a very comprehensive statistical and data analysis package it does have a very steep learning curve.

R Instructor is an iPhone, iPad and iPod Touch application that uses plain, non-technical language and over 30 videos to explain how to make and modify plots, manage data and conduct both parametric and non-parametric statistical tests.

Now added to the mobile science site.

# R version 3.1.1 released.

I just noticed that R was updated last month to version 3.1.1 ((Sock it to Me).

R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms.

I’ve also updated the list of Data Analysis packages for Mac OSX.

# ChemmineOB

ChemmineOB provides an R interface to a subset of cheminformatics functionalities implemented by the OpelBabel C++ project. OpenBabel is an open source cheminformatics toolbox that includes utilities for structure format interconversions, descriptor calculations, compound similarity searching and more. ChemineOB aims to make a subset of these utilities available from within R. For non-developers, ChemineOB is primarily intended to be used from ChemmineR as an add-on package rather than used directly.

# R update

I just noticed that there is an update to R on the CRAN website

This binary distribution of R and the GUI supports 64-bit Intel based Macs on Mac OS X 10.6 (Leopard) or higher. Since R 3.0.0 the binary is a single-arch build and contains only the x86_64 (64-bit Intel) architecture. PowerPC Macs and 32-bit Macs are only supported by building from sources or by older binary R versions. The default package type is "mac.binary" and the binary repository layout has changed accordingly.

There is a listing of data analysis packages for Mac OS X here.

# fmcsR: Mismatch Tolerant Maximum Common Substructure Searching in R

I’m not a big user of R a free software environment for statistical computing and graphics, but occasionally I notice cheminformatics modules being published. The latest issue of Bioinformatics DOI has a paper describing “fmcsR: Mismatch Tolerant Maximum Common Substructure Searching in R”.

The fmcsR package provides an R interface, with the time consuming steps of the FMCS algorithm implemented in C++. It includes utilities for pairwise compound comparisons, structure similarity searching, clustering and visualization of MCSs. In comparison to an existing MCS tool, fmcsR shows better time performance over a wide range of compound sizes. When mismatching of atoms or bonds is turned on, the compute times increase as expected, and the resulting FMCSs are often R1C5 substantially larger than their strict MCS counterparts. Based on R1C6 extensive virtual screening (VS) tests, the flexible matching feature enhances the enrichment of active structures at the top of MCS-based similarity search results. With respect to overall and early enrichment performance, FMCS outperforms most of the seven other VS methods considered in these tests.

fmcsR is freely available for all common operating systems from the Bioconductor site http://www.bioconductor.org/packages/devel/bioc/html/fmcsR.html.