Macs in Chemistry

Insanely great science

 

A few thoughts on scientific software

I recently got a rather sad email

It seems that Third Street Software quietly disappeared, breaking the syncing for Sente (reference management).

I've also heard about a couple of other smaller software developers who are finding life very tough and it started me thinking about the status of scientific software, after exchanging emails with a number of people in the industry (many thanks for their input) I thought I'd collect a few thoughts on my blog. The first thing I should say is I'm not a programmer, I'm a chemist who does some cut and paste scripting leaning heavily on stackoverflow.

Life in a Startup is Tough

I'm sure that there are many developers who would enjoy starting a scientific software company, they have an interesting scientific insight and a laptop, or have developed software in an academic lab that other scientists like to use. However, the conventional wisdom is that 90% of startups fail link and there are miles of newsprint dedicated to the reasons why startups fail, but the bottom line is I wouldn't expect the scientific software industry to be substantially different from others. However as a user of scientific software I'd like to see small companies given the best chance for success.

Is the scientific software market too small to support startup companies

I guess the answer is both yes and no, some software, for example reference management, could be aimed at the entire scientific community which would be a reasonably large market. However it is already served by a number commercial and open source products. The challenge here is demonstrate an advantage over the existing products. At the other end of the scale there is highly specialised software used by a small group of scientists, unless that small group is willing to pay a substantial amount it is difficult to see a long term viable model. It may be possible to strike a long term deal with a major Pharma company that provides financial viability, but being reliant on a single major customer does seem to be potentially vulnerable. In addition, there is a growing resistance to such long term commitments. A key factor is round-the-clock support, which is difficult for a small company to offer if they are not running an office on two or three continents. The other issue is visibility, in the post App-store world there are lots of applications clamouring for attention, relying on a viral Twitter campaign is probably not enough.

There is also the issue that some of the large software vendors offer a platform that "does everything", so once you've forked out for it there's a barrier to buying anything else because in one sense you're paying again for something you already have, even if it's not quite as good. In addition, even in the large software vendors you may find there are actually only a handful of developers who really know the code, if they leave then support becomes problematic. You can't replace them with any software developer, they need to have a background in science as well.

the docking package we have access to is not the best but it is OK for bench chemists

Open-Source Tools

I've been asked on several occasions if it possible to make a living supporting open-source software, I think is very tough, simply because people equate Open-source with free. The reality is that implementing open-source solutions can be very challenging and companies probably have to learn this the hard way. Their response is then usually to purchase a commercial product.

Most companies just aren't ready to pay reasonable amounts of money for open source support. It seems like you almost have to have products that you can sell to subsidize the open-source work.

Even if there is a "product" it is unlikely that they will value the large number of FTEs that may have gone into bringing the code to the current state.

That said there is certainly a role for integrating new academic/open-source/commercial tools, I notice Flare from Cresset now has a python API and is accessible via a Jupyter notebook using RdKit for the chemoinformatics support. This might be a platform that would allow smaller developers to compete with the "does everything" offering from the large vendors. With the change in the Pharma industry, with larger companies shrinking and a proliferation of university spin-outs, biotechs and small pharma this mix and match model might be of greater interest.

software

Academic Software

I've seen a lot of software published by academic labs and fair amount of the "Hints and Tutorials Pages" are taken up with tips on how to compile/install on Macs. However it seems to me that universities rarely have plans for longer term support, once the post-doc who wrote software leaves or the grant runs out development and support ceases . Some groups Open-Source the code in the hope that users might provide support, in reality this mostly just ends up as "abandonware". This is may not be because the code is not useful, but it is simply because the users are not programmers and lack the skills. Another issue is that the code may well not be documented and also be poorly written. This should not be a surprise though, much software is written solve a particular scientific problem not with the aim of sharing the code and I've never seen a grant proposal that contains funds for long term maintenance and development.

I have an undergrad who's porting an academic "open source" code …. It's taken him 3-4 weeks to understand the code it was so poorly written.

The reality is that even some of the most important and widely used code such as Numpy, Pandas and matplotlib are maintained by only a small number of people Python maintainers. This just underlines that most of the people in a software "community" are users not developers. Titus Brown has recently several posts recently on the issue of software sustainability and also makes the point that academic labs seem to prefer to write their own solution rather than support an existing solution.

Interestingly, academia has failed quite spectacularly in the area of converging solutions. The plethora of virtually identical bioinformatics solutions to any given problem (mapping! annotation!) largely exists because in academia we are incentivized more for the appearance of knowledge production than for actual progress on hard problems.

I'm suspect part of the problem is scientists being unaware what has already been developed.

Some of the quantum machine learning groups had no idea there was CCLIB for parsing output files. Now they're using it more.

For some areas Bioconda may allow users to find and install existing packages.

Bioconda is a channel for the conda package manager specializing in bioinformatics software.

What might be done?

There is surprisingly little publicity for open-source chemoinformatics toolkits, and products that build on them sometimes fail to mention it. Perhaps there should be a "Built on RdKit" sticker that developers could be encouraged to use.

I'm not sure we need yet another chemoinformatics toolkit, the open source toolkits rdkit, OpenBabel and CDK are all very well established. So build on existing toolkits don't try to write your own sdf file parser. Whilst there are lots of programming languages looking around it seems that Python has become the Lingua Franca for chemoinformatics. There is an interesting article of the developerWorks Blog Here's why you should use Python for Science research and I won't expand on it here. It may be for reasons of speed a library needs to be written in Fortran or C, however if you want to widen it's use write a Python wrapper. It might be worth considering how different users might want to interact with the code, expert users might be happy importing into a Jupyter notebook, however others might want a simple menu option. I've written Vortex scripts that access machine learning models I suspect access from a dropdown menu in Vortex is the most used option.

Another approach might be to build nodes for workflow tools like KNIME or Pipeline Pilot to make it easier for other scientists to integrate a new piece of software.

As regular readers will be aware I've become a frequent user of Jupyter notebooks and I wonder if this might be used as a "standard" way to interact with software packages, one nice feature is that it provides an easy way to document the scientific workflow, a computational chemistry electronic lab notebook.

Write high quality code, I've lost count of the number of times I've heard of wasted weeks/months trying to understand a piece of open-source code. At CICAG we are considering running a training course on "Python for Chemists" which could include a talk from a commercial developer on writing sustainable code. The course would also aim to show how to best use the existing toolkits with perhaps guidance on building user interfaces.

If you really need to create a new file type, make sure it is plain text and consider writing a plugin for OpenBabel so other developers can import/export easily.

Make your code citable https://guides.github.com/activities/citable-code/.

There are a few examples of community lead groups that might serve as an example for software sustainability.

CCP4 exists to produce and support a world-leading, integrated suite of programs that allows researchers to determine macromolecular structures by X-ray crystallography, and other biophysical techniques. This is supported by several UK funding bodies and by a number of industrial partners. One notable feature is that information is transferred between different programs within the suite by clearly defined (and policed) standard filetypes that include data and capture essential meta data that might be needed further down the processing pipeline. Any new package that is contributed to the suite is evaluated by working groups in terms of whether it brings an enhancement but also whether the code meets standards required. If accepted it will then be supported. Whilst relying on standard filetypes for exchange might seem rather cumbersome it does mean that it is relatively straight-forward to introduce new software tools. CCP4 has now been running for decades and has well organised structure could be used in other software communities.

There is also

AI-web-image-1

There is currently increased interest in Artificial Intelligence (AI) and machine learning, this may be an opportunity for small software companies to make an impact, many experts in artificial intelligence lack the domain expertise to identify suitable scientific problems and don't have access to the data required to build useful models. This could be an ideal opportunity for collaboration. At the moment it is not clear to me where AI will have the greatest impact and which algorithms will be best suited for which problems, there are certainly opportunities out there.

One area that might be of interest is computational tools that require major computing resources. For customers that might only have an intermittent need it makes sense to out-source rather than try to build an internal compute resource. In this case it may be possible to build a service model. I suspect for this work it would be necessary to build a single web page/site that offers access to a variety of computational tools via a consistent interface. Perhaps the system could be built that involved a fee for service so users could pay for a single use access to a particular tool and save the results in a standard file format.

Last Updated 17 July 2018