Macs in Chemistry

Insanely Great Science

Unix tips for dealing with very large files


I've updated the page describing a variety of unix commands that can be helpful when dealing with very large files. In particular I've added details of how to split very large files into more manageable chunks.

Dividing sdf files can be problematic since we need each division to be at the end of a record defined by "$$$$". I've spent a fair amount of time searching for a high-performance tool that will work for very, very large files. Many people suggest using awk

AWK (awk) is a domain-specific language designed for text processing and typically used as a data extraction and reporting tool. Like sed and grep, it's a filter, and is a standard feature of most Unix-like operating systems.

I've never used awk but with much cut and pasting from the invaluable Stack Overflow this script seems to work.

awk -v RS='\\$\\$\\$\\$\n' -v nb=1000 -v c=1 '
  printf "%s%s",$0,RT > file 
NR%nb==0 {c++}
' /Users/username/Desktop/SampleFiles/HitFinder_V11.sdf

The result is shown in the image below. There are a couple of caveats, this script only works with the version of awk shipped with Big Sur (you should be able to install gawk using Home Brew and use that on older systems), and it requires the file has unix line endings. The resulting file names is not ideal and if there are any awk experts out there who could tidy it up I'd be delighted to hear from you.


blog comments powered by Disqus