Macs in Chemistry

Insanely Great Science

data analysis

DataWarrior update


A new version of DataWarrior has been released

v05.05.00: April 2021

  • 3D-Structure alignment considering shape and pharmacophoric features (PheSA)
  • Google Patent search and results in DataWarrior (keyword, structure, date, ...)
  • Link to Spaya synthesis planning server
  • Searchable and navigatable user manual
  • Macro to retrieve and visualize world-wide Corona virus spreading
  • Lots of new features, range filter animations, smarter labels, ...



Pro Fit supports Big Sur


pro Fit pro Fit 7 is now at version 7.0.18, supporting dark mode, Catalina, and Big Sur.

pro Fit is a macOS application for data/function analysis, plotting, and curve fitting. It is used by scientists, engineers and students to analyze their measurements and the mathematical models they use to describe them.


  • Data windows for storing and analyzing data
  • Drawing windows for plots and other graphics
  • Function windows for user defined functions
  • Write your own functions and scripts using Python or Pascal
  • Numerous Curve Fitting Algorithms:
  • Levenberg-Marquardt, Robust, Multi-dimensional
  • High resolution, high quality drawings and graphs
  • Full PDF support for exporting figures
  • Big Sur and Retina Support

There is a comprehensive list of scientific applications under Big Sur here


Blueprint for Scientific Visualization and Cheminformatics Analysis for Small Molecule Project Data


Dotmatics have just announced the release of Blueprint a web-based visualization and scientific analytics application designed to help scientists working on small molecule discovery projects.

Those at the last Dotmatics user group meeting saw an early demo of this platform, a web-based, interactive, data visualisation and analysis system with chemical intelligence and now it has been released.


  • Load datasets from Browser or from files (SD, SMILES)
  • Visualize structures in interactive tables, grids, matrices
  • Visualize data as scatter plots, bar charts, line charts or pie charts
  • Calculate molecular properties, ligand profiles, ligand efficiencies
  • Refine datasets by filtering on structure and/or properties
  • Perform R-group and matched molecular pairs analysis
  • Generate program metrics
  • Share workspaces, datasets and analyses
  • Send selections to Browser, Vortex
  • Export results to files

I'm sure we will hear more at the Dotmatics UGM.


Table Dripper


Get data from web pages is one of common tasks for anyone involved in data analysis. Table data scraper working with your web browser can scrape tables and save them as CSV or HTML.



WebPlotDigitizer has been updated


I just noticed that WebPlotDigitizer has recently been updated.

It is often necessary to reverse engineer images of data visualizations to extract the underlying numerical data. WebPlotDigitizer is a semi-automated tool that makes this process extremely easy.

More details on the Data Analysis Tools Page.


Camelot site change


I've just been told the link to Camelot has changed, I've updated the link on the Data Analysis Tools Page.

Camelot: PDF Table Extraction for Humans. Camelot is a Python library that makes it easy for anyone to extract tables from PDF files!

To install using Anaconda

conda install -c conda-forge camelot-py


After installing the dependencies, tk and ghostscript, you can simply use pip to install Camelot:

pip install camelot-py[cv]

Camelot only works with text-based PDFs and not scanned documents.


Alvascience software


Alvascience recently released three new software packages for QSAR and chemoinformatics:

  • alvaMolecule is a free software (for academic use) to visualise, analyse, curate and standardize molecular dataset.
  • alvaModel is a software tool to create Quantitative Structure Activity/Property Relationship (QSAR/QSPR) models. The models developed using alvaModel can be easily deployed as 'alvaRunner projects'. Once a model has been deployed, it can be used by anyone via alvaRunner.
  • alvaRunner is a free software (for academic use) to apply QSAR/QSPR regression models, developed with alvaModel, on a set of molecules. It calculates the descriptors and fingerprints needed to apply the given QSAR/QSPR regression models and it does not need any other software to be used. You can find some alvaRunner projects here

There are also some introductory videos, also available on the YouTube channel:


Interactive plots in Jupyter Notebooks updated


I've been using Jupyter notebooks for a while for a wide variety of projects.

I've been looking at ways to produce interactive plots within a Jupyter notebook and after trying a couple of options to produce interactive data frames, in addition to 2D and 3D scatterplots including structures on tooltips.

Full review and the Jupyter notebook are here.



Interactive plots in Jupyter notebooks


I've been looking at ways to produce interactive plots within a Jupyter notebook and after trying a couple of options I used Plotly. This seems fairly straight-forward to use and I can produce interactive data frames, in addition to 2D and 3D scatterplots.

More details are shown here together with the jupyter notebook. It is very much a work in progress and suggestions are welcome. In particular, whilst I can get text to appear when hovering over a data point I'd be interested in ideas of how to get the structure displayed when you mouse over a point.





Data clean up is often one the most time-consuming parts any form of data analysis and I thought I'd mention pyjanitor.

pyjanitor is a project that extends Pandas with a verb-based API, providing convenient data cleaning routines for repetitive tasks.

It can be installed using conda

conda install pyjanitor -c conda-forge


pip install pyjanitor

There is extensive Documentation, including a section on cleaning chemistry data.

I've added it to the Open Source Python Data Science Libraries


Dealing with large data files


Spotted this on twitter

I've added xsv to the page of tips for handling very large data files.

xsv is a command line program for indexing, slicing, analyzing, splitting and joining CSV files.


StarDrop, version 6.6 released


Optibrium have just released StarDrop version 6.6 this update includes:

pKa prediction - A new model included in the ADME QSAR module. Existing ADME QSAR users can upgrade free of charge. Details of this were presented by Peter Hunt and the webinar can be accessed here.


SeeSAR™ modules An extended suite of SeeSAR™ modules to support structure-based design;

  • View – Visualise protein-ligand interactions in 3D.
  • Affinity – Analyse your ligand’s affinity with visual atomic contributions and torsion angle heat maps.
  • Pose – Generate compound poses for virtual screening and interactive 3D design.


A complete guide to K-means clustering algorithm


A little while back I compared different Options for Clustering large datasets of Molecules.

Clustering is an invaluable cheminformatics technique for subdividing a typically large compound collection into small groups of similar compounds. One of the advantages is that once clustered you can store the cluster identifiers and then refer to them later this is particularly valuable when dealing with very large datasets. This often used in the analysis of high-throughput screening results, or the analysis of virtual screening or docking studies.

One popular (and quick) technique for clustering is to use K-means clustering. I just came across this very useful explanation of K-means clustering, well worth a read.

A complete guide to K-means clustering algorithm.


Python leads the 11 top Data Science, Machine Learning platforms


The results of the latest KDnuggets poll, which is in it's 20th year, are in. Python is clearly moving to become the dominant platform with the votes for R slowly declining.


The blog post on KDnuggets gives far more detailed analysis and is well worth reading.


pro Fit 7.0.14 has just been released


pro Fit 7.0.14 has just been released and is available now at This is a maintenance update to QuantumSoft’s product for data and function analysis/plotting and nonlinear curve fitting.

This release improves Apple Script performance and fixes several other bugs. This is a recommended update for all users of pro Fit 7.0.


pro Fit 7 is a Mac OS application for data/function analysis, plotting, and curve fitting. It is used by scientists, engineers and students to analyse their measurements and the mathematical models they use to describe them. Users can define any mathematical function and use it to model their data, finding the function parameters that best describe it. A vast number of tools allow the mathematical and statistical analysis and processing of functions and data sets, and the software is also used to produce aesthetically pleasing graphical representations for books, articles, and any other reports involving plots of data and functions.

There is a listing of Data Analysis Tools for Mac OSX here.


Data Extractor


Data Extractor has been updated to version 1.7.1 with a number of internal improvements.

Data Extractor allows to extract data contained inside text documents and collect them in an internal organized table with fields and records. It can parse all the text files you specify and analyze them understanding from text tags what to extract and where to put it.

Data Extractor requires Mac OSX 10.10 or later.

There are more Data Analysis tools here.


Data curation workflow


One of the most time-consuming parts of any data analysis is curating the input data prior to any model building. This Knime workflow is fully documented and described and as such is an invaluable starting point.

A semi-automated procedure is made available to support scientists in data preparation for modelling purposes. The procedure address:

  • Automatic chemical data retrieval (i.e., SMILES) from different, orthogonal web based databases, by using two different identifiers, i.e. chemical name and CAS registration number. Records were scored based on the coherence of information retrieved from different web sources.
  • Data curation procedure performed to top scored records. The procedure includes removal of inorganic and organometallic compounds and mixtures, neutralization of salts, removal of duplicates, checking of tautomeric forms.
  • Standardization of chemical structures yielding to ready-to-use data for the development of QSARs.


Comparison of bioactivity predictions


Small molecules can potentially bind to a variety of bimolecular targets and whilst counter-screening against a wide variety of targets is feasible it can be rather expensive and probably only realistic for when a compound has been identified as of particular interest. For this reason there is considerable interest in building computational models to predict potential interactions. With the advent of large data sets of well annotated biological activity such as ChEMBL and BindingDB this has become possible.

ChEMBL 24 contains 15,207,914 activity data on 12,091 targets, 2,275,906 compounds, BindingDB contains 1,454,892 binding data, for 7,082 protein targets and 652,068 small molecules.

These predictions may aid understanding of molecular mechanisms underlying the molecules bioactivity and predicting potential side effects or cross-reactivity.

Whilst there are a number of sites that can be used to predict bioactivity data I'm going to compare one site, Polypharmacology Browser 2 (PPB2) with two tools that can be downloaded to run the predictions locally. One based on Jupyter notebooks models built using ChEMBL built by the ChEMBL group and a more recent random forest model PIDGIN. If you are using proprietary molecules it is unwise to use the online tools.

Read the article here


TS Calc The mathematical equations tool


TS Calc is a document based application and its documents can be realized and used as calculation models for specific mathematical technical problems. It is a complete different approach to solve math problems respect to the usual one using spreadsheets.



Optimizing colormaps with consideration for color vision deficiency to enable accurate interpretation of scientific data


Around 4% of the population suffer from colour blindness in one for or another with red/green colour blindness being the most common and sadly in many plots, graphs, presentations little effort is made to make things easier for those people with colour blindness.

Color blindness, also known as color vision deficiency (CVD), is the decreased ability to see color or differences in color. Simple tasks such as selecting ripe fruit, choosing clothing, and reading traffic lights can be more challenging. Color blindness may also make some educational activities more difficult.

A recent publication seeks to address this need, Optimizing colormaps with consideration for color vision deficiency to enable accurate interpretation of scientific data DOI

While there have been some attempts to make aesthetically pleasing or subjectively tolerable colormaps for those with CVD, our goal was to make optimized colormaps for the most accurate perception of scientific data by as many viewers as possible. We developed a Python module, cmaputil, to create CVD-optimized colormaps, which imports colormaps and modifies them to be perceptually uniform in CVD-safe colorspace while linearizing and maximizing the brightness range. The module is made available to the science community to enable others to easily create their own CVD-optimized colormaps.



19th annual KDnuggets Software Poll


The results of the 19th annual KDnuggets Software Poll are now in. Continuing the trend over the last few years Python continues to expand its user base and is now up to 66%. Since a couple of the other options are also Python based this could be an underestimate.

Pasted Graphic

There is more detailed analysis on the website. Interestingly Python seems to be the only programming language that is increasing in use.


Workshop on Computational Tools for Drug Discovery


In many companies/institutions/universities new arrivals are presented with a variety of desktop tools with little or no advice on how to use them other than "pick it up as you along". This workshop is intended to provide expert tutorials to get you started and show what can be achieved with the software.

The tutorials will be given a series of outstanding experts Christian Lemmen (BioSolveIT), Akos Tarcsay (ChemAxon), Giovanna Tedesco (Cresset), Dan Ormsby (Dotmatics) Greg Landrum (Knime ) and Matt Segall (Optibrium), you will be able to install the software packages on you own laptops together with a license to allow you to use it for a limited period after the event.

Registration and full details are here.

Computational Tools Flyer


Open Source Python Data Science Libraries


When I wrote the article entitled A few thoughts on scientific software one of the responses I got was that people did not know about the existence of open-source chemistry toolkits so I thought I'd publish a page that hopefully prevent stop people reinventing the wheel. Here are a few open-source cheminformatics toolkits that I'm aware of.

As a follow up I thought I'd put together a list of useful python libraries for data science

As always happy to hear comments or suggestion for additions.


Camelot, python tool for extracting PDF table data


Camelot is described as a PDF Table Extraction for Humans, it is a Python library that makes it easy to extract tables from PDF files.

>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables
<TableList n=1>
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html
>>> tables[0]
<Table shape=(7, 7)>
>>> tables[0].parsing_report
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html
>>> tables[0].df # get a pandas DataFrame!

Camelot only works with text-based PDFs and not scanned documents. Camelot also comes with a command-line interface. It can be installed using conda

$ conda install -c camelot-dev camelot-py

I've added it to the Data Analysis tools page




LabMathX is a MacOSX program for scientific analysis, calculations and Visualisation that includes support for older hardware.


  • LabMathX is Scriptable with AppleScript. Check the Dictionary with Apple's Script Editor.
  • LabMathX Supports Services and Can Be Accessed From the Services Menu in Other Services-Aware Applications.
  • LabMathX and Its Plug-ins Are Written in Objective C Under Cocoa.

Data Creator Updated


One of the things that I’m occasionally asked for is a test data set that can be used to evaluate an application. Whilst I keep a couple of data sets that I can use perhaps DataCreator will provide a more comprehensive solution. Data Creator is an application that has been designed to fill this important niche, Data Creator can be used to build very large data sets using field types defined by the user and then filled with random realistic content.

Data Creator can create sample tables (rows and columns) as you like and fill them with pseudo-random proper content (rows of content) with a single click. You can select which kind of fields (columns) you like (name of animals, colors, fruits, english surname, german names and so on with over 50 different kind of data) and have all the contents filled for how many rows you like in a click. It can export to Comma separated value, Tab separated values, html tables, even web pages ready to click or in any custom format you like.

The latest update brings a couple of bug fixes and

  • New type 'Decimal Number in Range' to many requested format such as currency (example: $ 1.99)
  • Improved error detection of data formatting
  • Optimized for macOS 10.12 Sierra

There is a review of DataCreator here.


Scaling Python with Dask webinar


This looks to be an interesting webinar on Dask Wednesday, May 30th at 2:00PM CDT.

Dask is a flexible parallel computing library for analytic computing.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.


Unix commands for helping deal with very large files


I'm regularly handling very large files containing millions for chemical structures and whilst BBEdit is my usual tool for editing text files in practice it becomes rather cumbersome for really large files (> 2 GB). In these cases I've compiled a useful list of UNIX commands that make life easier.

The page is part of the Hints and Tutorials section and can be viewed here.

Whilst I use them when dealing with large chemical structure files they are equally useful when dealing with any large text or data files.


A suggestion from a reader. Sometimes rather than one large file download sites provide the data as a large number of individual files. We can keep track of the number of files using this simple command.

MacPro:~ Chris$ ls | wc -1

If anyone has any additional suggestions please feel free to submit them.


Top 12 unix commands for data scientists.


A really useful post on KDnuggets.

With the beautiful intuitive interface it is sometimes easy to forget that Mac OS X has unix underpinnings and that the Terminal gives access to whole set of invaluable tools.

This post is a short overview of a dozen Unix-like operating system command line tools which can be useful for data science tasks. The list does not include any general file management commands (pwd, ls, mkdir, rm, ...) or remote session management tools (rsh, ssh, ...), but is instead made up of utilities which would be useful from a data science perspective, generally those related to varying degrees of data inspection and processing. They are all included within a typical Unix-like operating system as well.

If you regularly have to deal with very large data files some of these commands will be invaluable, for example:

head outputs the first n lines of a file (10, by default) to standard output. The number of lines displayed can be set with the -n option.

head -n 5 my file.txt

Read more here.


Rodeo: A Python IDE for Data Scientists


Just added Rodeo a python IDE built for analysing data to the page of data analysis tools.



Vortex update

Dotmatics have announced the impending release of the latest update to Vortex

The focus appears to be on the enhancement of the Vortex bioinformatics tools reviewed previously.


Data Aanlysis tools


I've just added the simple lightweight CSV editor Table Tool to the Data Analysis tools page.

The Data Analysis tools page contains a listing of over 100 applications, tools and libraries that can be used for data analysis under Mac OSX.


Vida updated


VIDA v4.4.0 has been released. This upgrade adds several new features and fixes many previous issues.

  • A new ribbon style that produces ribbons with a smoother appearance has been introduced into VIDA.


  • Improvements to the Builder/Sketcher, including:
  • closing the Sketcher window prompts for Save, Save as New, Discard, or Cancel
  • closing the Builder closes the Sketcher window
  • an additional “Save As New” option in the toolbar and Builder context menu
  • hitting Return now finishes adding typed-in molecules from the Sketcher
  • Significant improvements to the Extension Manager. In addition, extensions can be centrally deactivated.

VIDA is built on top of the OpenEye Toolkits v2017.Oct libraries to ensure that it and ancillary programs take full advantage of the state-of-the-art improvements in all underlying programming libraries. Support for macOS El Capitan (10.11), macOS Sierra (10.12), and macOS High Sierra (10.13) has been added.


Wizard Pro updated


Wizard Pro is a data analysis with ease of use a key design feature. It is designed to encourage the user to click and explore data.


There is a listing of Data Analysis tools for MacOSX here.


2017 The State of Data Science & Machine Learning


A new review looking at data science and machine learning.

This year, for the first time, we conducted an industry-wide survey to establish a comprehensive view of the state of data science and machine learning. We received over 16,000 responses and learned a ton about who is working with data, what’s happening at the cutting edge of machine learning across industries, and how new data scientists can best break into the field.


The full dataset is there for you to explore, and the interactive web page allows you to slice and dice on the fly. The above plot looks at the tools being used by scientists, Python dominates and interesting to see the number using Jupyter notebooks. The methods used are shown in the plot below.



2017 Wolfram Technology Conference


An interesting blog entry on the recent 2017 Wolfram Technology Conference. This is a unique experience where researchers and professionals interacted directly with those who build each component of the Wolfram technology

I particularly like this comment.

It was not uncommon for software engineers or physicists to glean new tricks and tools from a social scientist or English teacher—or vice versa—a testament to the diversity and wide range of cutting-edge uses Wolfram technologies provide.

The blog entry is well worth a read.

Delighted to see the ubiquitous presence of MacBook Pros!


DataWarrior Updated


I notice that DataWarrior has had a couple of updates recently.

DataWarrior combines dynamic graphical views and interactive row filtering with chemical intelligence. Scatter plots, box plots, bar charts and pie charts not only visualize numerical or category data, but also show trends of multiple scaffolds or compound substitution patterns.

The latest updates

v04.06.01: August 2017 Fixed plugin interface bug. Various small bug-fixes and improvements.
v04.06.00: July 2017 new plugin interface to easily develop database access extentions

DataWarrior can be downloaded here


Flot plots updated


I have updated the page showing the interactive plots using Flot and ChemDoodle Web Components

I have a regular need to share results from my work and historically this has been via a paper reports that have more recently been replaced by electronic versions. Whilst useful, these reports lack the interactivity, in particular it is extremely useful to be able to easily link data points on a scatter plot with the corresponding chemical structure. So I’ve started using web-based reports to add extra functionality. Unfortunately it has often required the addition of applets or plugins that I can’t be sure the viewer will have available so with the advent of HTML5 I’ve been exploring writing the reports using just HTML and javascript. One of the major challenges is to produce interactive plots instead of using static images, and I’ve been exploring the use of Flot to produce a plot with chemical structures produced using either a web-service like ChemSpider or a javascript library of web components developed by ChemDoodle.



Free machine learning and data science ebooks


An interesting post By Matthew Mayo, KDnuggets.

Here is a quick collection of such books to start your fair weather study off on the right foot. The list begins with a base of statistics, moves on to machine learning foundations, progresses to a few bigger picture titles, has a quick look at an advanced topic or 2, and ends off with something that brings it all together. A mix of classic and contemporary titles, hopefully you find something new (to you) and of interest here.


StarDrop 6.4


StarDrop 6.4 now links prepared 3D docking and alignment models with data visualisation, 2D SAR analyses and predictive models in a single interface.

Computational chemists can make their validated 3D models available to their colleagues via StarDrop’s Pose Generation Interface, which is compatible with software from major computational chemistry providers, including:

  • FlexX™ – BioSolveIT
  • Gold™ – Cambridge Crystallographic Data Centre
  • MOE™ – Chemical Computing Group
  • AutoDock Vina – The Scripps Research Institute
  • POSIT™ – OpenEye Scientific
  • …extendable to other third party applications.

The Pose Generation Interface communicates with a Pose Generation Server, on which computational chemists can easily publish their validated docking or 3D alignment models. These are made instantly available for StarDrop users to submit their compounds and the resulting poses, protein structures and scores are returned directly to StarDrop for visualisation and analysis.

The Pose Generation Server can be installed wherever you run your 3D modelling software, supporting Linux, Windows® and Mac®

There are more details in the poster presented at the Spring ACS 2017.


Data Extractor updated


The Data Analysis Tools page contains a list of applications for data analysis that run under Mac OSX, in addition I've also included some other useful tools. Included in the list is Data Extractor.

Data Extractor allows to extract data in a sparse format contained inside various files and collect the data you need in an internal structured table. Collected data can be exported at any time in various format (CSV, TSV, HTML, Custom). Data extractor can parse thousands and thousands of file in few seconds and collect the data inside. It uses simple smart instructions about how to recognize the data you need, how to extract them and where to put these data inside a structured table, ready to be exported.

Version 1.5 updates:

  • Additional force option: 'Prefix at Start of Line'
  • Extraction algorithm improved
  • Bug fix extracting data with start tag having a space as first character
  • Other minor bug fix
  • Optimized for macOS 10.12 Sierra


Scaffold Hunter update


Scaffold Hunter is a chemical data organization and analysis tool and that has been continuously enhanced since the start of its development in 2007. The platform-independent open-source tool was first released in 2009 and provided an interactive visualisation of the so-called scaffold tree, which is a hierarchical classification scheme for molecules based on their common scaffolds. A recent publication describes recent extensions that significantly increase the applicability for a variety of tasks DOI.

When I first opened the application I did not find it particularly intuitive, fortunately there is a online tutorial and sample datasets available.




I've just been sent details of an app to aid generating regular expressions, Expressions. I use BBEdit for most of my regular expression searching but this looks a brilliant way to build the query.



Vortex does Biology


I was at the Dotmatics UGM recently and they gave an insight into some of the future directions. One of the areas under consideration is the use of Vortex support for Biological data analysis.

Vortex is a very high performance data analysis and plotting tool, capable of handling many millions of rows of data. It also has chemical intelligence built in, allowing structure-based searching, physicochemical properties calculation, clustering and match pair analysis.

The support for biology is a new addition and I've written a brief review here.


Added to the growing list of software reviews.


Swift Algorithm Club


The Swift Algorithm Club is a new site that described implementations of popular algorithms and data structures in Swift. However there is also an added bonus in that there are also detailed explanations of how they work. The list below gives an idea of what is available or under construction, and I’m sure they would be delighted to receive contributions.

The algorithms


  • Linear Search. Find an element in an array.
  • Binary Search. Quickly find elements in a sorted array.
  • Count Occurrences. Count how often a value appears in an array.
  • Select Minimum / Maximum. Find the minimum/maximum value in an array.
  • k-th Largest Element. Find the k-th largest element in an array, such as the median.
  • Selection Sampling. Randomly choose a bunch of items from a collection.
  • Union-Find. Keeps track of disjoint sets and lets you quickly merge them.

String Search

  • Brute-Force String Search. A naive method.
  • Boyer-Moore. A fast method to search for substrings. It skips ahead based on a look-up table, to avoid looking at every character in the text.
  • Knuth-Morris-Pratt
  • Rabin-Karp
  • Longest Common Subsequence. Find the longest sequence of characters that appear in the same order in both strings.


It's fun to see how sorting algorithms work, but in practice you'll almost never have to provide your own sorting routines. Swift's own sort() is more than up to the job. But if you're curious, read on...

Basic sorts:

  • Insertion Sort
  • Selection Sort
  • Shell Sort

Fast sorts:

  • Quicksort
  • Merge Sort
  • Heap Sort

Special-purpose sorts:

  • Counting Sort
  • Radix Sort
  • Topological Sort

Bad sorting algorithms (don't use these!):

  • Bubble Sort


  • Run-Length Encoding (RLE). Store repeated values as a single byte and a count.
  • Huffman Coding. Store more common elements using a smaller number of bits.


  • Shuffle. Randomly rearranges the contents of an array.
  • Comb Sort. An improve upon the Bubble Sort algorithm.


  • Greatest Common Divisor (GCD). Special bonus: the least common multiple.
  • Permutations and Combinations. Get your combinatorics on!
  • Shunting Yard Algorithm. Convert infix expressions to postfix.
  • Statistics

Machine learning

  • k-Means Clustering. Unsupervised classifier that partitions data into k clusters.
  • k-Nearest Neighbors
  • Linear Regression
  • Logistic Regression
  • Neural Networks
  • PageRank


Importing Open Source Malaria Data into DataWarrior


Thomas Sander from has provided a version of DataWarrior that can directly import the Open Source Malaria Data.

The new version can be downloaded here, once downloaded and you will need to temporarily adjust your security settings to open it the first time. This is because DataWarrior is not from the Mac App Store or an identified developer. Once open make sure you reset your security settings.


Once installed and opened select the macro as shown below to retrieve the Open Source Malaria Data.


The import only takes a few seconds and pulls the data directly from the Open Source Malaria spreadsheet so it will contains the latest information.


There are now a variety of different options for accessing the Open Source Malaria data you can use the Cheminfo spreadsheet, or use a Vortex script or even an iPython notebook.


MathStatica 2.72


mathStatica 2.72 is fully compatible with Mathematica 11

mathStatica 2.7 unleashes the power of your computer — automatically — featuring phenomenal speed and power for users with multi-processor machines.

mathStatica 2.7 Parallel Processing Engine — on Mathematica 11


Timings in seconds using Mathematica 11 running on an R2-D2 Mac Pro computer

There is a listing of data analysis tools for MacOSX here


9000 packages on CRAN


The latest update to the CRAN R archive brings the total number of packages to 9004.


2016-08-22: 9000 packages
2016-02-29: 8000 packages
2015-08-12: 7000 packages
2014-10-29: 6000 packages
2013-11-08: 5000 packages
2012-08-23: 4000 packages
2011-05-12: 3000 packages
2009-10-04: 2000 packages
2007-04-12: 1000 packages
2004-10-01: 500 packages
2003-04-01: 250 packages

There is a listing of data analysis tools for Mac OSX here.


Mathematica 11 Is Now Available


Mathematica 11 has been released.

We are pleased to announce that Mathematica 11 has arrived, with over 500 new functions! Continuing on the path of aggressive innovation that Stephen Wolfram first embarked on 30 years ago, Version 11 embraces new areas of modern technology and introduces cutting-edge functionality to match. With Mathematica, you can now print 3D models and plots directly through either local or cloud-based 3D printers. Or instead, identify over 10,000 objects, and classify and extract features in your data with the customizable suite of enhanced machine learning tools. You can also construct, train and evaluate high-performance neural networks with both CPU and GPU support, enabling powerful deep learning in just a few lines of code. Integrated support for audio, from trimming and filters to synthesizing sounds and measuring audio, makes Mathematica 11 a flexible platform for digital audio processing and analysis.

You can read more about it in Stephen Wolfram’s blog post.


StarDrop 6.3 released

Optibrium have just announced the release of StarDrop 6.3, perhaps the highlight of this release is the introduction of the new SeeSAR module.

The SeeSAR module developed in collaboration with BioSolve ITprovides seamless access in StarDrop to 3D structures based on X-ray crystallography or predicted with any docking software. The intuitive link between this 3D information and StarDrop’s cheminformatics analyses and visualisations, based on 2-dimensional compound structure, gives new insights into structure-activity relationships (SAR) within your project chemistry and aids the design of improved compounds. It also supports collaboration between computational and synthetic chemists, helping to share the results of 3D modelling with all decision makers.


You can watch a video tutorial here




Molsoft have just announced and interesting new product ICM-Scarab, a one-stop shop for capturing and analysing bioinformatics and chemoinformatics data. It provides and electronic notebook for storing experimental information integrated with query tools that allow the user to effortlessly search both internal and external SQL databases.

There is a webinar Wed, Jun 29, 2016 5:00 PM - 6:00 PM BST if you want to find out more.


17th annual KDnuggets Software Data Analysis Poll


The results of the annual data analysis poll are in and show some interesting trends, in particular the dramatic increase in Python use.

R remains the leading tool, with 49% share (up from 46.9% in 2015), but Python usage grew faster and it almost caught up to R with 45.8% share (up from 30.3%).

Actually looking down the list I notice there is also an entry for scikit-learn, which is Python based, and if you add that in Python is now the most commonly used data analysis tool.

There was a 10% drop in the use of KNIME, and a 36% drop in the use of TIBCO Spotfire two products used in cheminformatics.

In terms of programming languages Python is by far the most extensively used.

Python 45.8% share (was 30.3%) 51% increase
Java 16.8% share (was 14.1%) 19% increase
Unix shell/awk/gawk 10.4% share (was 8.0%) 30% increase
C/C++ 7.3% share (was 9.4%) 23% decrease
Other programming languages 6.8% share (was 5.1%) 34.1% increase

In the Big Data area Hadoop (22.1%) and Spark (21.6%) dominate.

There is a listing of data analysis tools for MacOSX here.


Cytoscape Update


Cytoscape has been updated to version 3.4.0

Note, This update requires Java 8 is installed and Mac OS X 10.9 and later.

Cytoscape is an open source software platform for visualizing molecular interaction networks and biological pathways and integrating these networks with annotations, gene expression profiles and other state data.


Data Extractor updated


Data Extractor has been updated to version 1.4.2

Data Extractor allows to extract data in a sparse format contained inside various files and collect the data you need in an internal structured table. Data extractor can parse thousands and thousands of file in few seconds and collect the data inside

More details here…

There more tools for data analysis here




If you regularly have to manually edit files containing data in delimited text format then this application maybe of interest.

DB-Text is a general purpose tool for editing delimited text files. It can automatically recognise the used format analysing the content inside. It can accept data with mixed use of quotas and provides tools to copy in CSV (comma separated),TSV (tab separated) or HTML format of selected rows in the clipboard, with a simple click.


I've added it to the Data Analysis Tools page


PAST, free software for scientific data analysis


I was just sent a link to PAST free software for scientific data analysis, with functions for data manipulation, plotting, univariate and multivariate statistics, ecological analysis, time series and spatial analysis, morphometrics and stratigraphy.

Current version (February 2016): 3.11 runs under Mac OSX 10.8 and later.

Hammer, Ø., Harper, D.A.T., Ryan, P.D. 2001. PAST: Paleontological statistics software package for education and data analysis. Palaeontologia Electronica 4(1): 9pp.

I've added it to the page of data analysis packages for Mac OSX.


Shinobicontrols iOS charting


I've just been sent a link to an advanced charting kit for mobile devices Shinobicontrols if you are developing an iOS app that requires plots or charts this may be a useful addition.

If you are looking for a graphing toolkit for both iOS and MacOS then it might be worth looking at the tools from VVI


Tabula is awesome!


I recently needed to download the supplementary information provided with a publication, my heart sank when I saw it was provided as a PDF file. My worst fears were justified when I tried to simply copy and paste SMILES strings together with 5 columns of data into a spreadsheet, no chance of it copying across in an ordered manner!

Then I tried Tabula a tool for "liberating data tables locked inside PDF files". It worked perfectly, nearly 2000 rows of data spread over 11 pages converted to a csv file in a couple of mouse clicks. This is wonderful and should be part of any data scientists toolkit.

It is included on the Data Analysis Tools page but really deserves a special mention.


RRegrs: an R package for computer-aided model selection with multiple regression models


I just thought I'd flag a paper in Journal of Cheminformatics, RRegrs: an R package for computer-aided model selection with multiple regression models DOI.

We propose an integrated framework for creating multiple regression models, called RRegrs. The tool offers the option of ten simple and complex regression methods combined with repeated 10-fold and leave-one-out cross-validation. Methods include Multiple Linear regression, Generalized Linear Model with Stepwise Feature Selection, Partial Least Squares regression, Lasso regression, and Support Vector Machines Recursive Feature Elimination. The new framework is an automated fully validated procedure which produces standardized reports to quickly oversee the impact of choices in modelling algorithms and assess the model and cross-validation results. The methodology was implemented as an open source R package, available at, by reusing and extending on the caret package.


Wizard Pro updated


Wizard Pro has been updated.

What's New in Version 1.7.18

  • Added support for Numbers 3.6 files
  • Exporting data from the Summary view now has a progress bar and cancel button
  • Exporting data now writes directly to disk instead of building the file in memory first
  • Many other performance improvements throughout the program
  • Added support for generating the R command for the Friedman test
  • Added support for generating SAS/SPSS commands for repeated-measures tests
  • Report the combined significance of constant coefficients for models with multiple sets of coefficients
  • Fix a crash after exporting ROC curves
  • Fix a bug when exporting DTA files with missing values
  • Fix a bug where formatting options weren't showing in the Pivot view
  • Fix a number of minor interface glitches
  • Improved support for importing SAS and SPSS command files
  • Improved support for SAS catalog files

There is a review of Wizard Pro here, and a listing of data analysis tools for Mac OSX here.


Data Creator Updated


Data Creator has been updated. This is an invaluable tool if you ever need to create a pseudo-random data-set.

What's New in Version 1.5 • New creator type: USA cities, Italian cities, French cities, German Cities. • Bug fix: Changing format to export, the file extension was not changing automatically • Other minor bug fix. • Optimized for OS X 10.11 El Capitan

Data Creator can create sample tables (rows and columns) as you like and fill them with pseudo-random proper content (rows of content) with a single click. You can select which kind of fields (columns) you like (name of animals, colors, fruits, english surname, german names and so on with over 50 different kind of data) and have all the contents filled for how many rows you like in a click.


Wizard Pro updated


The popular data analysis tool Wizard Pro for Mac has been updated. Wizard includes a full set of tools for doing professional research, yet its friendly interface makes statistics accessible to beginners. There is a review here.

New in 1.7.17:

• Show ellipses when data is truncated in the Raw Data view
• Fix a bug where Shapiro-Wilk and one-column Kolmogorov-Smirnov tests on highly repetitious data gave overly conservative results
• Improved support for exporting Stata .dta files
• Improved support for importing compressed SAS data files
• Improved support for importing SAS catalog files containing a large number of value labels

There is a comprehensive listing of data analysis tools for MacOSX here.


Computational chemistry guides & tools


The Medicines for Malaria Venture have an interesting page in which they are accumulating a list of computational tools and guides describing the use of the tools to address key issues within the drug discovery process.

Tools were chosen to address common needs expressed by medicinal and computational chemists working in the not-for-profit area. Recognising that this is a global effort, we have selected software packages on the basis of being free for all users.

The guides are either text descriptions or webcasts showing the tool in action. To date they include DataWarrior, KNIME, YASARA, ChEMBL and PK Tool.




csvkit is a suite of utilities written in Python for converting to and working with files in csv format. csvkit is designed to be used a replacement for most of Python’s csv module simply

import csvkit

It can also be called from the command line

in2csv data.json > data.csv

To install on a Mac you can use use PIP a tool for installing and managing Python packages.

pip install csvkit

It is supported on OSX and Linux. It also works–but is tested less frequently–on Windows.


pro Fit 7 released


pro Fit has been updated to version 7, it is a Mac OS X application for data/function analysis, plotting, and curve fitting. This is a complete rebuilding of pro Fit from the ground up to complete the transition to cocoa and the latest Mac OS X technologies.

The release notes give full details but a couple of notable features are

  • 64 bit architecture: pro Fit comes as a universal binary and runs in 64 bit mode by default. If you need to run pro Fit under 32 bit (e.g. because you want to link to a 32 bit plug-in or Python module), you can set it to run under 32 bit mode by choosing Get Info in the Finder.
  • Global search: It is now possible to search for a text string simultaneously in all text, data, and drawing windows.
  • Sandboxing: pro Fit 7 is a sandboxed application, supporting all standard security features offered by the OS. As a side effect, the location of the plug-in folder has changed. To locate it, choose "Open user's plug-in folder" from the "Customize" menu.
  • Scripting: pro Fit 7 changed the interfaces to some commands, which sometimes required that some of the programming interfaces be modified, too. In addition, we harmonized some naming conventions in our programming interfaces. Please see the "programming" read me file for more details.
  • The fitting engine now can (optionally) use long-double precision for enhanced accuracy.
  • pro Fit now supports high-resolution images on retina displays.

Sandboxing does of course mean a few changes.

pro Fit cannot access files outside its application container without explicit user permission. The application container is found under "~/Library/Containers/com.quansoft.profit". Therefore:

If you are running a script that accesses files outside pro Fit's containers, e.g. in your Documents folder, you must grant pro Fit explicit permission to do so: Choose Preferences form the pro Fit menu, and navigate to the tab Security. Then, add the desired directory to the list of accessible directories. The permission will be permanently stored, i.e. it will persist even if you restart pro Fit.

The "pro Fit plug-ins" folder, which contains scripts and plug-ins to be automatically loaded during start-up, is now placed in pro Fit's container (under ~/Library/Containers/com.quansoft.profit/Data/Library/Application Support/com.quansoft.profit/pro Fit plug-ins). To navigate to that folder, choose "Open User's Plug-in folder" from the cutomize menu.

There is a list of data analysis tools for Mac OS X here.


Wizard Pro Updated


The popular data analysis and plotting application has been updated New in 1.7.9:

• Fix a crash when opening a document with a Quantile column
• Fix a crash when attempting to use a column as its own join key
• Fix an occasional crash after entering or leaving Full Screen mode
• Fix a few minor interface glitches
• Bug fix: the XLS and JSON exporters did not properly respect Data Filters
• Bug fix: Time-of-day columns derived from other columns were not properly displayed in the Raw Data view
• Bug fix: weighted log-linear models sometimes produced an error
• Bug fix: a model's constant term was not included when exporting the coefficient table as XLS
• Feature: Include prediction intervals when exporting prediction tables

There is a review of Wizard Pro here.


Tabula 1.0 released


If you have ever been in the situation where supporting information for a publication is provided in PDF format then you will appreciate Tabula. Tabula allows you to extract that data into a CSV or Microsoft Excel spreadsheet using a simple, easy-to-use interface.

I've added it to the page of data analysis tools.


Poll on data analysis tools


The results of the 16th annual KDnuggets Software Poll on data analysis tools is in.

The top 10 tools by share of users were

R, 46.9% share ( 38.5% in 2014, 37% in 2013)
RapidMiner, 31.5% ( 44.2% in 2014, 39% in 2013)
SQL, 30.9% ( 25.3% in 2014, NA in 2013)
Python, 30.3% ( 19.5% in 2014, 13% in 2013)
Excel, 22.9% ( 25.8% in 2014, 28% in 2013)
KNIME, 20.0% ( 15.0% in 2014, 6% in 2013)
Hadoop, 18.4% ( 12.7% in 2014, 9% in 2013)
Tableau, 12.4% ( 9.1% in 2014, NA 2013)
SAS, 11.3 (10.9% in 2014, 10.7% in 2013)
Spark, 11.3% ( 2.6% in 2014, NA in 2013)

The results very much reflect my own interactions, whilst R has a significant installed user base and of course a vast repository of open source packages, Python seems to be gaining traction. Certainly in part because Python seems to have become the lingua franca for scientific computing.

I've always thought of KNIME and Tableau as excellent tools for implementing workflows but looking at recent iterations it is clear there is now greater emphasis on interactive analysis.

There is a listing of data analysis tools for Mac OS X here.


CheS-Mapper Updated


CheS-Mapper has been updated. CheS-Mapper (Chemical Space Mapper) is a 3D-viewer for chemical datasets with small compounds. Whilst executable jar files can be downloaded from the website the source code is available on GitHub.

There is a review of an older version of Ches-Mapper here.




The latest issue of Journal of Cheminformatics has a paper that might be of interest to a variety of people involved in spectroscopy or data visualsation. SpeckTackle: JavaScript charts for spectroscopy.

We present SpeckTackle, a custom-tailored JavaScript charting library for spectroscopy in life sciences. SpeckTackle is cross-browser compatible and easy to integrate into existing resources, as we demonstrate for the MetaboLights database. Its default chart types cover common visualisation tasks following the de facto ‘look and feel’ standards for spectra visualisation.

SpeckTackle is an open-source JavaScript library to create custom-tailored charts for spectroscopy in life sciences. Implemented charts exist for mass spectrometry, one- and two-dimensional NMR, UV/VIS, IR, and general continuous data use cases such as chromatograms.

The authors kindly supply a demo web page demonstrating different chart types and functions of the SpeckTackle library. Example data is embedded in the web page (800 kb file size). Click on the buttons at the top of the page to see the data displayed. For the Chromatogram, Difference Chart and Spectral Match click the button then the Add Data button.

Highlighting a section of the spectra expands the view and mouseover on the 2D NMR spectra provides a tooltip giving chemical shifts

I've added this to the spectroscopy resources page


DataWarrior Update


DataWarrior 4.1.1 is available for download, in addition to precompiled binaries all Java source files and the script to build DataWarrior on Linux/MacOSX can be downloaded for free use under the GNU public license. DataWarrior is a free data visualization and analysis program with embedded chemical intelligence.


There is a review of DataWarrior here.


Wizard Pro 1.7.4


The popular data analysis tool Wizard Pro has been updated to version 1.7.4.

New in 1.7.4:

  • Fix a display issue in the Summary view when unchecking filters
  • Fix a crash in the Predict view when a model has either no outcome variables or no explanatory variables
  • Fix "Can't connect" error when attempting to connect to a database using a password containing special characters
  • Support for connecting to databases over IPv6
  • Support for schemas in PostgreSQL
  • Support for character types in PostgreSQL
  • Improved support for importing CSV files with improperly quoted values
  • Improved support for importing variable labels and frequency weights from SPSS files
  • Increased maximum length of exported SPSS variable labels from 120 characters to 256 characters

There is a review of Wizard Pro here, and there is a listing of data analysis tools for Mac OS X here.


Scripting Vortex 25


Whilst most of the Vortex scripts mentioned on this site to date involve chemical structures we should not forget that Vortex is an excellent general data analytics tool and the data set does not have to include any molecular structures. Recently I was asked about the number of publications associated with a particular potential therapeutic target and it struck me that Vortex might actually be an excellent tool to investigate this.

Read More.



A review of Wizard Pro


When I first started the list of data analysis packages for Mac OS X it was a fairly short list, over the years the list has grown and the diversity of packages increased. From free packages like R to enterprise applications like IBM SPSS costing thousands. Some packages are enormously powerful but have a ferocious learning curve, whilst others are very easy to use but have only very limited capabilities. Wizard is an intuitive data analysis tool, designed from the ground up to be readily accessible but still retain the power of the sophisticated command line driven applications that only seem suitable for programmers. Wizard Pro allows the user to explore the data interactively without the need to learn a programming language. Read more here.


I should have added Wizard Pro runs under Yosemite and is on the list of Yosemite compatible applications, and has just been updated to version 1.6.7 (Feb 27th).


MOSAIC is a modular toolbox for analyzing data from single molecule experiments


The interactions of single molecules with nanopores are observed by measuring changes to the ionic current that occurs when the pore changes from an unoccupied (i.e., an open channel) to an occupied state. The electrical nature of the measurement allows us to model components of the physical system with equivalent electrical elements, and describe system behaviour collectively with the circuit response.

MOSAIC is a modular toolbox for analyzing data from single molecule experiments. Primarily developed to analyze data from nanopore experiments. MOSAIC’s GUI greatly simplifies analyzing data from single-molecule nanopore experiments and provides easy access to most common algorithms and data types. MOSAIC can also be scripted using PYTHON to run multiple analyses in batch mode. It can also be integrated into Mathematica, MATLAB or IGOR Pro workflows.

Balijepalli, A. Ettedgui, J, Cornio, A. T., Robertson, J. W. F. Cheung, K. P., Kasianowicz, J. J. & Vaz, C., ACS Nano 2014, 8, 1547–1553


Datamate Numeric Processor


Datamate Numeric Processor allows you to Normalize, standardize, scale, and manage missing data and data outliers quickly and accurately.

There is a listing of data analysis tools for Mac OS X here.


Data Extractor has been updated


Data Extractor has been updated to version 1.4. Data Extractor allows to extract data in a sparse format contained inside various files and collect the data you need in an internal structured table. Collected data can be exported at any time in various format (CSV, TSV, HTML, Custom). Data extractor can parse thousands and thousands of file in few seconds and collect the data inside. It uses simple smart instructions about how to recognize the data you need, how to extract them and where to put these data inside a structured table, ready to be exported.

Update includes

  • Usage of prefix to identify a data with data on a newline (prefix with newline at end)
  • More resilient extraction algorithm
  • Faster algorithm, often 10x time faster of the previous release
  • Improved multithreading capabilities
  • Fast adding of DataBase fields during 'Extraction Rules' editing and adding
  • Extraction of data based on position (example: 3th element of a tab separated values row) at popular demand
  • Solved a bug causing crash during extractions under certain circumstances
  • Solved a bug with double newline at the end of files
  • Solved a bug under other specific text characteristic of files to extract
  • Other generic bug fixes

There are more data analysis tools here


Alternative to OriginLab for Mac


A reader has contacted me asking for suggestions for alternatives to OriginLab that run on a Mac. Whilst you can run OriginLab under virtualisation there are many Data Analysis Packages that run under Mac OS X natively but I don't know enough about OriginLab to suggest which has similar capabilities. Any suggestions?


Data Creator updated


Data Creator has been updated to version 1.4.

Data Creator is an advanced data generator that can create table filled with pseudo-random custom content in just few clicks, absolutely invaluable when you need data to test a database of data analysis tool.

What's New in Version 1.4

  • New (faster) algorithm for records creation
  • Better handling of creation of a higher number of records in a single command
  • Improved more informative progress toolbar for longer operations
  • More commands to add, delete and set number of records
  • Improved menus
  • Bug fix regarding the fields table
  • Bug fix regarding the record table
  • Other bug fixes
  • Improved general stability using a more resilient code

There is a review of Data Creator here.


Plot2 a scientific 2D plotting program


I've just added Plot2 to the list of data analysis tools available for Mac OS X.

This project started in 1993 with SciPlot on NeXTStep and was updated at the end of November 2014. Plot2 is designed for everyday plotting, it is easy to use, it creates high quality plots, and it allows easy and powerful manipulations and calculations of data.


VIDA v4.3.0 released


OpenEye have announced the release of VIDA v4.3. This is a major update with many new features and enhancements, including improvements to depiction, 2D alignment, list manager manipulation, surface selection and display, default colouring schemes, both visual and list-driven atom subset selection, cluster viewing, colouring by SD property and extension management.

One feature I’m sure will be very popular is the new advanced depiction options, including atom property maps from the Grapheme TK, substructure highlighting, and 2D structure alignment, are available for depiction in the 2D window and spreadsheet


Support for Mac OS X 10.8 and 10.9 was added
Mac OS X 10.6 is no longer supported


ConTour: Data-Driven Exploration of Multi-Relational Datasets for Drug Discovery


Caleydo is an open source visual analysis framework targeted at biomolecular data. It has been described in a number of publications and I noticed that a recent project ConTour included chemical structures.

Large scale data analysis is nowadays a crucial part of drug discovery. Biologists and chemists need to quickly explore and evaluate potentially effective yet safe compounds based on many datasets that are in relationship with each other. However, there is a is a lack of tools that support them in these processes. To remedy this, we developed ConTour, an interactive visual analytics technique that enables the exploration of these complex, multi-relational datasets.

Christian Partl, Alexander Lex, Marc Streit, Hendrik Strobelt, Anne-Mai Wassermann, Hanspeter Pfister, Dieter Schmalstieg ConTour: Data-Driven Exploration of Multi-Relational Datasets for Drug Discovery IEEE Transactions on Visualization and Computer Graphics (VAST '14), to appear, 2014.

I’ve added Caleydo to the listing of data analysis tools.


Chemistry document classifier


The latest issue of J Cheminformatics has an article entitled “A document classifier for medicinal chemistry publications trained on the ChEMBL corpus”, Journal of Cheminformatics 2014, 6:40 doi:.

The large increase in the number of scientific publications has fuelled a need for semi- and fully automated text mining approaches in order to assist in the triage process, both for individual scientists and also for larger-scale data extraction and curation into public databases. Here, we introduce a document classifier, which is able to successfully distinguish between publications that are ‘ChEMBL-like’ (i.e. related to small molecule drug discovery and likely to contain quantitative bioactivity data) and those that are not. The unprecedented size of the medicinal chemistry literature collection, coupled with the advantage of manual curation and mapping to chemistry and biology make the ChEMBL corpus a unique resource for text mining.

The models, workflows and tools are freely available for download.


CheS-Mapper updated


CheS-Mapper has been updated to version 2.4.

New Features Add Moss as new structural fragment mining algorithm Show the number of distinct 3D positions (at the top right, alongside other dataset info) Mapping warnings are now acessible within the viewer (Menu: Help > Show mapping warnings) Add hint for multiselection of compounds via 'control'-key (is shown when zooming into compounds for the first 3 times) More Changes The viewer no longer zooms out when changing component size or spread Add log conversion of feature values, by adding a new feature, instead of log-highlighting (gives better overview of log-distributed values, e.g. within the chart) Multiple selected compounds are now highlighted within the chart for nominal features (was only possible for numerical features) Fix Fix error that showed strucutural fragment values as '1'/'0' instead of 'match'/'no-match'

CheS-Mapper (Chemical Space Mapper) is a open source 3D-viewer for chemical datasets of small molecules, a publication in the Journal of Chemiformatics describes an early version of the application DOI: 10.1186/1758-2946-4-7, and there is a review here.


Sage mathematics software


Sage is a Python based free open-source mathematics software system licensed under the GPL. It builds on top of nearly 100 open-source packages: NumPy, SciPy, matplotlib, Sympy, Maxima, GAP, FLINT, R to provide a common unified interface, either as a notebook in a web browser or the command line.

In addition to a local installation it is also possible to use SageMathCloud a free service with support from the University of Washington.

I’ve added Sage to the list of data analysis tools for Mac OS X.


CheS-Mapper 2.2 released


CheS-Mapper has been updated, CheS-Mapper (Chemical Space Mapper) is a 3D-viewer for chemical datasets with small compounds.

The tool can be used to analyze the relationship between the structure of chemical compounds, their physico-chemical properties, and biological or toxic effects. CheS-Mapper embedds a dataset into 3D space, such that compounds that have similar feature values are close to each other. It can compute a range of descriptors and supports clustering and 3D alignment.

There is a review of Ches-Mapper here


R version 3.1.1 released.


I just noticed that R was updated last month to version 3.1.1 ((Sock it to Me).

R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms.

I’ve also updated the list of Data Analysis packages for Mac OSX.


Sentira Review


Sentira is a new chemical data visualisation tool from Optibrium. The focus is on ease of use data visualisation and as such is probably targeted at the bench scientist rather than a specialist computational scientist. It supports a selection of plotting and SAR tools.

I’ve written a review of my first impressions.

There is also a list of data visualisation applications here.


Datawarrior review


DataWarrior is a data analysis tool that understands chemistry, it provides an efficient way to search, sort and analyse structure-activity data. DataWarrior was developed at Actelion and it is highly integrated into the drug discovery platform, in 2014 it was decided to release DataWarrior without the integration layer as a stand-alone tool to the public. DataWarrior is a Java application and thus is cross platform.

I’ve written a review on my initial impressions.


Aabel NG released


Gigawiz have just announced the release of an updated version of Aabel.

Aabel™ NG is the result of a 5-year special project and massive development effort, rewriting millions lines of code to transform the Gigawiz flagship product into an icon of power and simplicity for professional users.

The update includes:

  • Modernization and optimization of the main processing code
  • A total redesign of the user interface, adding power with modeless simplicity
  • Complete Carbon (32-bit)-to-Cocoa (64-bit) transformation
  • Minimizing overhead between the user interface and the main processing code (a crucial step for sustainability and future development)
  • Replacement of high-level Cocoa classes with Aabel NG-specific code, a key alteration for any part of the application that requires high performance with large data sets
  • Addition of numerous new features

The transformation to 64-bit should enable much larger data sets to be handled, and there are a host of new statistical methods however I’m not familiar enough with the area to comment on how useful they are.

Interactive graphing means that you no longer have to manually update a plot to see how the changes look, the enhanced exploratory analytics could be really useful when working with large data-sets.

A Mac OS X Mavericks; Intel- based Mac and a colour monitor with a minimum resolution of 1440 x 900 is required. There are also a couple of issues with backward compatibility.

Aabel v3/v2 are 32-bit applications; Aabel NG (v4) is 64-bit and hence cannot use Apple data structures and formats that have not been ported to 64-bit. In summary, the implications are as follows: A feature of Aabel v3/v2, called database metaphor (i.e., a container that refers to a collection of worksheets), due to its dependency on Apple 32-bit subsystems, is not available in the 64-bit version. While Aabel NG (v4) reads the worksheet files of Aabel v3/v2, it will NOT be able to read their viewer files, because the latter files have contents that require Apple 32-bit subsystems (which are unavailable in 64-bit). In Aabel v3/v2, the System Alias Manager manages the connections between the graphs and their source worksheets. The Alias Manager has unresolved issues making the continuation of its use not sustainable. In Aabel NG (v4), the saved graphs will be hot-linked to their source worksheets along paths defined at the time their pipelines are created.

The price of the new version is $1250 (academic $750) and there are reduced pricing for upgrades.

I’m using Aabel 3 at the moment and I’m very happy with it, I’d be very interested to hear from anyone who upgrades.


DataDesk 7 released


I’m delighted to report DataDesk 7 has been released, this update requires a Macintosh computer running Mac OS X 10.6, 10.7, 10.8, or 10.9 and Intel processor and 2 GB RAM.

Data Desk brings fast, easy-to-use visual analysis to your desktop. It provides interactive graphical tools for exploring and understanding your data—for finding the patterns, relationships, and exceptions. While it implements many traditional statistics techniques suitable for data from planned experiments and sample surveys, Data Desk’s true strength is its powerful tools for data exploration. Its insightful graphic displays simplify intuitive investigation of your data.

DataDesk can handle very large datasets (up to 2 billion cases), it supports a wide variety of statistical techniques and supports linked plots so that selections in one plot are highlighted in another so you get multiple views of your data. They have supported the Mac since the mid 1980’s and I suspect it is now the oldest commercial application for the Mac. A series of continuous updates have kept DataDesk at the forefront of data analysis. The accompanying manual also serves as an invaluable resource for learning statistical analysis.

I’ve updated the list of data analysis tools available for the Mac.


Scripting Vortex 21, displayling web pages


Well things can change quickly at times, in the last tutorial I wrote..

Vortex has a limited capacity to render HTML, it is however a very limited ability so there is no support for javascript or CSS but you can introduce a number of useful extra features.

If you download the latest daily build of Vortex from the Dotmatics Support site there is a version that comes bundles with Java 8, if you download this version are a host of new options for displaying plots. In particular you can now display web pages, follow links on pages, and there is support for javascript.

In Scripting Vortex 21 there is a demonstration of this feature and an example script that uses SMARTCyp to predict sites of metabolism.


There are many more scripts on the Hints and Tutorials Page.


Scripting Vortex 20


Vortex has a limited capacity to render HTML, it is however a very limited ability so there is no support for javascript or CSS but you can introduce a number of useful extra features.

Im the latest tutorial you can find out how to use this to add images, plots and graphs to the molecular worksheet.

Scripting Vortex 20:-Adding images to Vortex



Panoply netCDF, HDF and GRIB Data Viewer


I’ve just added Panoply to the list of data analysis applications. Panoply is an application from NASA that plots geo-gridded and other arrays from netCDF, HDF, GRIB, and other datasets. You can:-

  • Slice and plot geo-gridded latitude-longitude, latitude-vertical, longitude-vertical, or time-latitude arrays from larger multidimensional variables.
  • Slice and plot "generic" 2D arrays from larger multidimensional variables.
  • Slice 1D arrays from larger multidimensional variables and create line plots.
  • Combine two geo-gridded arrays in one plot by differencing, summing or averaging.
  • Plot lon-lat data on a global or regional map using any of over 100 map projections or make a zonal average line plot.
  • Overlay continent outlines or masks on lon-lat map plots.
  • Use any of numerous color tables for the scale colorbar, or apply your own custom ACT, CPT, or RGB color table.
  • Save plots to disk GIF, JPEG, PNG or TIFF bitmap images or as PDF or PostScript graphics files.
  • Export lon-lat map plots in KMZ format.
  • Export animations as AVI or MOV video or as a collection of individual frame images.
  • Explore remote THREDDS and OpenDAP catalogs and open datasets served from them.



The Data analysis app Wizard Pro has been updated


Wizard Pro has just been updated. Wizard Pro is a data analysis application with easy exploration in mind. The new release notes include


  • Database support: import from SQLite, MS Access, MySQL, and PostgreSQL
  • Numbers '13 support
  • Timestamp / time-of-day support
  • Customizable data partitions. Separate numeric data into groups of equal size, intervals of equal width, or user-defined intervals
  • Best-fit lines on scatterplots, and reference lines on Q-Q plots
  • Visualize critical values and p-values with the new Bottom Line popover (see screenshots)
  • New "Copy Predicted Values" menu item applies a predictive model to the full data set


  • Histograms are much sharper now
  • More tick marks and labels on all the graphics
  • Full access to the Column tools from inside the Raw Data view
  • New preference option: choose a "Friendly" or "Neutral" font for The Bottom Line
  • Filtered and frequency-weighted models run much faster than before
  • Excel output is much prettier -- with bold, italics, and indentation for clarity.


  • Exporting data now has a progress bar and a Cancel button
  • Support for up to 6 data filters
  • Support for up to 5 pivot columns

There are more graphing or plotting applications on the data analysis page.


DataGraph updated to version 3.2


I just saw that the highly regarded DataGraph has been updated, this update includes:

  • Pivot Table command - great for data analysis.
  • Hover information to inspect data in a plot.
  • A new method to import text files.
  • Runs faster when you have large data sets.
  • Improvements to basically every drawing command.
  • More formatting options.
  • More operations to edit data, automatically fill in entries etc using menu and context menu entries.
  • Can label graphs and export multiple graphs at the same time

There are more graphing or plotting applications on the data analysis page.

There is a review of DataGraph 3.0 in The Journal of Statistical Software


KST added to list of data analysis tools


I’ve just added KST to the list of data analysis applications for Mac OSX

Features of Kst include:

  • Robust plotting of live "streaming" data.
  • Powerful keyboard and mouse plot manipulation.
  • Powerful plugins and extensions support.
  • Large selection of built-in plotting and data manipulation functions, such as histograms, equations, and power spectra.
  • A number of unique tools which dramatically improve efficiency, such as the "Data Wizard" for fast and easy data import, the "Edit Multiple" mode to bulk-edit most objects, or the "Change Data File" tool to compare results from different experiments
  • Color mapping and contour mapping capabilities for three-dimensional data, as well as matrix and image support.
  • Monitoring of events and notifications support.
  • Built-in filtering and curve fitting capabilities.
  • Convenient command-line interface.
  • Powerful graphical user interface with non-modal dialogs for an optimized workflow.
  • Support for several popular data formats.
  • Multiple tabs.
  • Extended annotation objects similar to vector graphics applications.
  • High-quality export to bitmap or vector formats.


Graph Builder Updated


Graph Builder has been updated to version 10.9.16.

  • Made the heat map (aka: image map, point fill) and 3D scatter, surface and volume color mapping editor significantly better.
  • Added a palette that shows how to script a multi-level animated pie chart.
  • Removed depreciated system calls.
  • Adjusted many items under the hood in preparation for v11.
  • Special Note: The v11 build is being worked on and your feedback to is very welcome.

Graph Builder is a powerful application rich in graphic editing, creation and programming to facilitate the visualization of information. It has a good complement of 2D and 3D graph features, a full-fledged user interface and is programmable. Paste data into table editors, write scripts to generate data, load a Xcode plugin you write for data generation and to retrieve data from external sources.

There is a comprehensive list of data analysis tools for Mac OSX here.


GAUSS Mathematical and Statistical System


I’ve just added GAUSS Mathematical and Statistical System to the page of data analysis tools for Mac OS X.

The GAUSS Mathematical and Statistical System is a fast matrix programming language widely used by scientists, engineers, statisticians, biometricians, econometricians, and financial analysts. Designed for computationally intensive tasks, the GAUSS system is ideally suited for the researcher who does not have the time required to develop programs in C/C++ or FORTRAN but finds that most statistical or mathematical “packages” are not flexible or powerful enough to perform complicated analysis or to work on large problems.


Data analysis tools


I’ve just updated the Data Analysis tools for Mac OSX page. I’ve fixed the broken links and added another eight packages to bring to total upto ninety. Browsing through it looks like just about every area of science is covered, from open-source packages to enterprise focussed applications.




I’ve just added Graph-R to the page of data analysis tools.

Graph-R is an application used to create 3-dimensional contours, contour lines, wire frames, and scatter diagrams from numeric data files(CSV files). Graph setting is easy. The perspective direction can be freely changed using your mouse. Graph that are created can be saved as PNG or JPEG files.

This software requires Mac OSX 10.8 or later.


Graph Builder


I just got a message about an update to Graph Builder a very popular and powerful application from VVimaging, Inc rich in graphic editing, creation and programming to facilitate the visualization of information. It has a excellent complement of 2D and 3D graph features, a full-fledged user interface and is programmable. Paste data into table editors, write scripts to generate data, load a Xcode plugin you write for data generation and to retrieve data from external sources. Also supports dynamic graphs.




The free univariate data modeling package Regress+ has been updated to version 2.7.1.

Regress+ offers:-

  • Plain textfile input
    • Equations, with or without uncertainties (weights)
    • Distributions, continuous or discrete data
    • Discrete data grouped or ungrouped
  • Datasets up to 4,294,967,295 points (minimum 7)
  • Up to 10 parameters
  • User-selected optimization criterion (where appropriate)
    • Least squares
    • Minimum average abs(residual)
    • Maximum likelihood
    • Minimum K-S statistic
    • Minimum chi-square statistic
  • Full, dated Report (textfile)
  • Robust goodness-of-fit testing for distributions
    • Tunable precision
  • [Optional] State-of-the-art (BCa) central confidence intervals (90, 95 and 99 percent)
    • Tunable precision
  • High-quality (PDF, PNG) plots with one keystroke!
    • X/Y plot, with or without error bars
    • PDF plot
    • CDF plot
    • Probability plot for goodness-of-fit (see above)
    • [Optional] Logarithmic axes (when appropriate)
    • Editable axis labels
    • Automatic tick marks/labels (see above)
  • [Optional] Predictions for unobserved values or percentiles
    • With confidence intervals if desired
  • [Optional] Constant parameter(s)
  • 21 Built-in Equations
    • Plus user-defined model
    • [Optional] Test residuals for systematic error
    • [Optional] List data with fitted estimates and residuals
    • [Optional] Simulated-annealing mode for initial parameter estimates
  • 59 Built-in Distributions
    • 9 continuous, symmetric
    • 27 continuous, skewed
    • 11 continuous mixtures
    • 6 discrete
    • 6 discrete mixtures
    • [Optional] Creation of synthetic samples
  • No hidden assumptions anywhere
    • No approximations, apart from those common to sampling and bootstrapping generally
    • No data transformations of any kind
  • Extensive documentation

There is comprehensive listing of Mac OSX data analysis packages here.


R update


I just noticed that there is an update to R on the CRAN website

This binary distribution of R and the GUI supports 64-bit Intel based Macs on Mac OS X 10.6 (Leopard) or higher. Since R 3.0.0 the binary is a single-arch build and contains only the x86_64 (64-bit Intel) architecture. PowerPC Macs and 32-bit Macs are only supported by building from sources or by older binary R versions. The default package type is "mac.binary" and the binary repository layout has changed accordingly.

There is a listing of data analysis packages for Mac OS X here.


Scripting Vortex 16


OCHEM is a free open access site of annotated models and chemical data. OCHEM contains 1831772 experimental records for about 477 properties collected from 12457 sources you are free to upload your own data and also build predictive models using existing or your own data.

There are also a number of already built models that the public can access, these include

  • Ames test
  • CYP1A2 inhibition
  • LogP and Solubility

You can run predictions on OCHEM using simple REST-like web services, these vortex scripts submit tasks to the various models and then retrieve the resulting prediction.


Graph Builder updated


Graph Builder has been updated

  • Added data input for 2D vector field presentation.
  • Added programmed and animated 2D vector field palette.
  • Updated script documentation for 2D vector field animation and programming.
  • Added a preference option to turn on the built-in network graphing server.
  • Added ability to display dynamic and programmed Graph Builder document results over the web.
  • Updated the manual to explain new features

There is a listing of data analysis applications for Mac OSX here.


Updated Applescript Resources


I’ve just updated the Applescript Resources page, in particular I’ve included updates to the great tools provided by Satimage-Software. These include Smile a programming and working environment that you can use in a variety of situations. You may want to perform a scientific work, to handle cgi requests, to automate an intensive file processing task, to produce computed graphics, to edit XML files, to work with Unicode texts, to make GUI of your scripts, and SmileLab the SmileLab license adds the data visualization features to Smile, the automation environment by Satimage-Software.


In SmileLab you can

  • extract data from files (default data formats supported: text, binary, FITS, XNF, ...),
  • perform data processing using commands provided with Smile or controlling external code
  • visualize your data in the most usual forms (curves, scatter plots, bar graphs, contour lines, color maps and vector fields in 2D, and 3D surfaces),
  • customize the interaction of the user with the plots (handling mouse clicks, contextual menus, keyboard events...) and create custom interfaces,
  • export your plot as a PDF file, as a bitmap picture (PNG, JPEG, TIFF, BMP, PSD) or as a QuickTime movie.