Macs in Chemistry

Insanely Great Science

Unix tips for dealing with very large files


I've updated the page describing a variety of unix commands that can be helpful when dealing with very large files. In particular I've added details of how to split very large files into more manageable chunks.

Dividing sdf files can be problematic since we need each division to be at the end of a record defined by "$$$$". I've spent a fair amount of time searching for a high-performance tool that will work for very, very large files. Many people suggest using awk

AWK (awk) is a domain-specific language designed for text processing and typically used as a data extraction and reporting tool. Like sed and grep, it's a filter, and is a standard feature of most Unix-like operating systems.

I've never used awk but with much cut and pasting from the invaluable Stack Overflow this script seems to work.

awk -v RS='\\$\\$\\$\\$\n' -v nb=1000 -v c=1 '
  printf "%s%s",$0,RT > file 
NR%nb==0 {c++}
' /Users/username/Desktop/SampleFiles/HitFinder_V11.sdf

The result is shown in the image below. There are a couple of caveats, this script only works with the version of awk shipped with Big Sur (you should be able to install gawk using Home Brew and use that on older systems), and it requires the file has unix line endings. The resulting file names is not ideal and if there are any awk experts out there who could tidy it up I'd be delighted to hear from you.



Coronavisrus stats online.


If you are a using a UNIX based operating system (such as Mac OS X) just type this command in a Terminal window



Looks to be about 12 hours behind latest data.



Dealing with large data files


Spotted this on twitter

I've added xsv to the page of tips for handling very large data files.

xsv is a command line program for indexing, slicing, analyzing, splitting and joining CSV files.


Poll results: How do you pronounce zsh?


A week ago I posted a poll asking how to pronounce zsh, well the results are in.


Well, the winner is Zeeshell (pronounced like seashell), however this is clearly not unanimous.

Several readers also pointed out this thread on StackExchange What are the practical differences between Bash and Zsh? which contains lots of useful information.

This book might also be useful Moving to zsh.


Poll How do you pronounce zsh?


The poll results are here.

As people migrate to Catalina there is an option to update your default shell.

zsh is the new default shell for new users (bash is the default shell in macOS Mojave and earlier), so if you are upgrading you may want to change your default shell to zsh.

Paul Falstad wrote the first version of Zsh in 1990[6] while a student at Princeton University.[7] The name zsh derives from the name of Yale professor Zhong Shao (then a teaching assistant at Princeton University) — Paul Falstad regarded Shao's login-id, "zsh", as a good name for a shell.

You might also like to look at oh-my-zsh

A delightful community-driven (with 1,300+ contributors) framework for managing your zsh configuration. Includes 200+ optional plugins (rails, git, OSX, hub, capistrano, brew, ant, php, python, etc), over 140 themes to spice up your morning, and an auto-update tool so that makes it easy to keep up with the latest updates from the community.

So while it was pretty obvious how to pronounce "Bash" (The shell's name is an acronym for Bourne-again shell, a pun on the name of the Bourne shell that it replaced), but what about "zsh"?

This book might also be useful Moving to zsh.

bike trails

Top 12 unix commands for data scientists.


A really useful post on KDnuggets.

With the beautiful intuitive interface it is sometimes easy to forget that Mac OS X has unix underpinnings and that the Terminal gives access to whole set of invaluable tools.

This post is a short overview of a dozen Unix-like operating system command line tools which can be useful for data science tasks. The list does not include any general file management commands (pwd, ls, mkdir, rm, ...) or remote session management tools (rsh, ssh, ...), but is instead made up of utilities which would be useful from a data science perspective, generally those related to varying degrees of data inspection and processing. They are all included within a typical Unix-like operating system as well.

If you regularly have to deal with very large data files some of these commands will be invaluable, for example:

head outputs the first n lines of a file (10, by default) to standard output. The number of lines displayed can be set with the -n option.

head -n 5 my file.txt

Read more here.