Macs in Chemistry

Insanely great science

 

Unix commands for helping deal with very large files

I'm regularly handling very large files containing millions for chemical structures and whilst BBEdit is my usual tool for editing text files in practice it becomes rather cumbersome for really large files (> 2 GB). In these cases I've compiled a useful list of UNIX commands that make life easier. Whilst I use them when dealing with large chemical structure files they are equally useful when dealing with any large text or data files.

If you start up the Terminal application which is in the Utilities folder in the Applications folder. You should see a window something like this:-

terminal

The shell prompt (or command line) is where one types commands, the appearance may differ slightly depending on customisation.

File and Path Names

One thing to remember is that UNIX is less accommodating than Mac OS X when it comes to file name and the paths to files. When you see the full path to a file it will be shown as a list of directory names starting from the top of the tree right the way down to the level where your file sits. It will look something like this.

/Users/Username/Projects/bioconda-recipes/recipes/rdkit/2016.03.3/meta.yaml

Directory names can contain spaces but you’ll often find that people working in Unix prefer to keep both file and directory names without spaces since this simplifies the process of working with these files on a command line. Directory or file names containing a space must be specified on the command line either by putting quotes around them, or by putting a \ (backslash) before the space.

'/Users/Username/Projects/bioconda recipes/recipes/rdkit/2016.03.3/meta.yaml '

or

/Users/Username/Projects/bioconda\ recipes/recipes/rdkit/2016.03.3/meta.yaml

If typing out the full path seems rather daunting remember if you drag a file in the Finder and drop it on a Terminal window the path to the file will be automatically generated.

Examining Large files

The other thing to remember is that there is an extensive help available, so if you can't remember the command line arguments "man" can provide the answers. Simply type man followed by the command.

MacPro:~ Chris$ man wc


WC(1)                     BSD General Commands Manual                    WC(1)

NAME
     wc -- word, line, character, and byte count

SYNOPSIS
    wc [-clmw] [file ...]

DESCRIPTION
    The wc utility displays the number of lines, words, and bytes contained
    in each input file, or standard input (if no file is specified) to the
    standard output.  A line is defined as a string of characters delimited
    by a <newline> character.  Characters beyond the final <newline> charac-
     ter will not be included in the line count.

    A word is defined as a string of characters delimited by white space
     characters.  White space characters are the set of characters for which
     the iswspace(3) function returns true.  If more than one input file is
    specified, a line of cumulative counts for all the files is displayed on
    a separate line after the output for the last file.

     The following options are available:

    -c      The number of bytes in each input file is written to the standard
            output.  This will cancel out any prior usage of the -m option.

     -l      The number of lines in each input file is written to the standard
             output.

     -m      The number of characters in each input file is written to the
            standard output.  If the current locale does not support multi-
            byte characters, this is equivalent to the -c option.  This will
            cancel out any prior usage of the -c option.

    -w      The number of words in each input file is written to the standard
            output.

In addition to quickly view the man page for any command, just right-click on the command in the Terminal and choose Open man Page from the context menu. A new window will pop up, displaying the manual for that command.

manpage

So the command wc (word count) can be used to count words, lines, characters, or bytes depending on the option used. For example for a file containing SMILES strings the following command tells us how many lines (and hence the number of molecules). In this case the 4GB file contains 66,783,025 molecules.

MacPro:~ Chris$ wc -l /Users/Chris/Desktop/pubchem_2015_03_via_mol2img.smi 
66783025 /Users/Chris/Desktop/pubchem_2015_03_via_mol2img.smi

If instead of counting lines we count words using the -w option, we get a different answer. This is because SMILES string can contain salts, counterions etc.

MacPro:~ Chris$ wc -w /Users/Chris/Desktop/pubchem_2015_03_via_mol2img.smi 
 133566050 /Users/Chris/Desktop/pubchem_2015_03_via_mol2img.smi

All the structures from ChEMBL are available for download as a 3.5 GB sdf file. Attempting to open such a file in a desktop text editor would not be recommended, however using a few UNIX commends we can interrogate the file to get useful information. For example the following tells us how many lines there are in the file.

Chris$ wc -l /Users/Chris/Desktop/chembl_22.sdf 
 119893133 /Users/Chris/Desktop/chembl_22.sdf

However in this case since each molecule record in a sdf file has multiple lines we don't know how many structures there are. We can however look at a small portion of the file using the head command

MacPro:~ Chris$ man head
HEAD(1)                   BSD General Commands Manual                  HEAD(1)

NAME
     head -- display first lines of a file

SYNOPSIS
     head [-n count | -c bytes] [file ...]

DESCRIPTION
    This filter displays the first count lines or bytes of each of the speci-
   fied files, or of the standard input if no files are specified.  If count
   is omitted it defaults to 10.

So if we type

MacPro:~ Chris$ head -50 /Users/Chris/Desktop/chembl_22.sdf 
CHEMBL153534
  SciTegic12231509382D

 16 17  0  0  0  0            999 V2000
    7.6140  -22.2702    0.0000 C   0  0
    5.7047  -23.1991    0.0000 C   0  0
    6.1806  -22.5282    0.0000 N   0  0
    6.9604  -22.7690    0.0000 C   0  0
    4.8790  -23.2163    0.0000 N   0  0
    8.2791  -21.1119    0.0000 N   0  0
    7.5280  -21.4445    0.0000 C   0  0
    8.4225  -22.4364    0.0000 C   0  0
    8.8353  -21.7198    0.0000 C   0  0
    6.2035  -23.8527    0.0000 S   0  0
    4.0534  -23.2163    0.0000 C   0  0
    6.9776  -23.5889    0.0000 C   0  0
    3.6406  -22.4938    0.0000 N   0  0
    3.6406  -23.9215    0.0000 N   0  0
    8.4397  -20.3035    0.0000 C   0  0
    9.6552  -21.6280    0.0000 C   0  0
  2  3  2  0
  3  4  1  0
 4  1  1  0
  5  2  1  0
  6  7  1  0
  7  1  2  0
  8  1  1  0
  9  8  2  0
 10 12  1  0
 11  5  2  0
 12  4  2  0
 13 11  1  0
 14 11  1  0
 15  6  1  0
 16  9  1  0
  6  9  1  0
 10  2  1  0
M  END
> <chembl_id>
CHEMBL153534

$$$$
CHEMBL440060
          11280716012D 1   1.00000     0.00000     0

206208  0     1  0            999 V2000
    2.0792   -0.9500    0.0000 N   0  0  0  0  0  0           0  0  0
   1.4671   -0.5958    0.0000 C   0  0  2  0  0  0           0  0  0
    0.8510   -0.9500    0.0000 C   0  0  0  0  0  0           0  0  0
    0.8510   -1.6583    0.0000 O   0  0  0  0  0  0           0  0  0

We can now see that each molecule record is separated by $$$$ and we can use this to identify how many molecules are in the file using grep and counting the number of occurrences.

MacPro:~ Chris$ man grep

SYNOPSIS
     grep [-abcdDEFGHhIiJLlmnOopqRSsUVvwxZ] [-A num] [-B num] [-C[num]]
          [-e pattern] [-f file] [--binary-files=value] [--color[=when]]
          [--colour[=when]] [--context[=num]] [--label] [--line-buffered]
          [--null] [pattern] [file ...]

DESCRIPTION
     The grep utility searches any given input files, selecting lines that
     match one or more patterns.  By default, a pattern matches an input line
     if the regular expression (RE) in the pattern matches the input line
     without its trailing newline.  An empty expression matches every line.
     Each input line that matches at least one of the patterns is written to
     the standard output.

So if we type this in the terminal window.

MacPro:~ Chris$ grep -c '$$$$' /Users/Chris/Desktop/chembl_22.sdf 
1678393

In a similar manner we can use tail to look at the end of the file.

MacPro:~ Chris$ tail -n10 /Users/Chris/Desktop/chembl_22.sdf 
 22 24  1  0
  8 20  1  0
  8 17  1  0
 54 55  2  0
  8 60  1  6
M  END
> <chembl_id>
CHEMBL3706412

$$$$

Sometimes you just want to test a new tool on a subset of a large file, in this case we can use head and pipe the results into a new file.

MacPro:~ Chris$ head -10000 /Users/Chris/Desktop/pubchem_2015_03_via_mol2img.smi  > /Users/Chris/Desktop/UnixLargeFiles/test.txt

Dealing with Errors in very large files

Sometimes files contain non-printable ASCII characters and it can be hard to track these down. Whilst you can use BBEdit to zap these characters but for larger files this command might be a better option.

You can use this Perl command to strip these characters by piping your file through it:

perl -pi.bak -e 's/[\000-\007\013-\037\177-\377]//g;'

If that does not work the best option is often to simply split a file into smaller files

MacPro:~ Chris$ man split


 -l line_count
         Create smaller files n lines in length.

 -p pattern
         The file is split whenever an input line matches pattern, which
         is interpreted as an extended regular expression.  The matching
         line will be the first line of the next output file.  This option
         is incompatible with the -b and -l options.

 If additional arguments are specified, the first is used as the name of
 the input file which is to be split.  If a second additional argument is
 specified, it is used as a prefix for the names of the files into which
 the file is split.  In this case, each file into which the file is split
 is named by the prefix followed by a lexically ordered suffix using
 suffix_length characters in the range ``a-z''.  If -a is not specified,
 two letters are used as the suffix.

 If the name argument is not specified, the file is split into lexically
 ordered files named with the prefix ``x'' and with suffixes as above.

We can split the file test.txt created earlier

MacPro:~ Chris$ split -l 100 /Users/Chris/Desktop/UnixLargeFiles/test.txt /Users/Chris/Desktop/UnixLargeFiles/splitfile

splitfile

Once you have worked out where the error is and corrected it the files can be joined back together using cat (concatenate). The command below concatenates all the splitfile* (using * as a wild card) text files and writes them to a new file

MacPro:~ Chris$ cat /Users/Chris/Desktop/UnixLargeFiles/splitfile* > /Users/Chris/Desktop/UnixLargeFiles/correctedfile.txt

correctedfile

You can then use wc to check the concatenated file has the correct number of lines.

Editing Files

cat is also the swiss army knife for file manipulation, combined with tr

MacPro:~ Chris$ man tr

TR(1)                     BSD General Commands Manual                    TR(1)

NAME
     tr -- translate characters

SYNOPSIS
     tr [-Ccsu] string1 string2
     tr [-Ccu] -d string1
     tr [-Ccu] -s string1
     tr [-Ccu] -ds string1 string2

DESCRIPTION
     The tr utility copies the standard input to the standard output with substitution or deletion of selected characters.

To convert a tab delimited file into commas

cat tab_delimited.txt | tr "\\t" "," comma_delimited.csv

Monitoring output files

The command tail has two special command line option -f and -F (follow) that allows a file to be monitored. Instead of just displaying the last few lines and exiting, tail displays the lines and then monitors the file. As new lines are added to the file by another process, tail updates the display. This very useful if you are monitoring a log file to check for errors when processing a very large file.

tail -f /Users/Chris/Desktop/UnixLargeFiles/output.log

Checking for duplicate structures

One of the issues with combining multiple data sets is there is always the risk of duplicate structures. In order to check for this you need a unique identifier for each molecular structure, I tend to use InChIKey.

The IUPAC International Chemical Identifier InChI is a textual identifier for chemical substances, designed to provide a standard way to encode molecular information and to facilitate the search for such information in databases and on the web. The condensed, 27 character InChIKey is a hashed version of the full InChI (using the SHA-256 algorithm).

First use OpenBabel to generate the InChiKey file

MacPro:~ Chris$ /usr/local/bin/babel  -ismi   '/Users/Chris/Desktop/UnixLargeFiles/withdupsfile.smi'  -oinchikey   '/Users/Chris/Desktop/UnixLargeFiles/output.inchikey'

We can check the file using head

MacPro:~ Chris$ head -10 /Users/Chris/Desktop/UnixLargeFiles/output.inchikey 
RDHQFKQIGNGIED-UHFFFAOYSA-N
RDHQFKQIGNGIED-UHFFFAOYSA-O
INCSWYKICIYAHB-UHFFFAOYSA-N
HXKKHQJGJAFBHI-UHFFFAOYSA-N
HIQNVODXENYOFK-UHFFFAOYSA-N
VYZAHLCBVHPDDF-UHFFFAOYSA-N
MUIPLRMGAXZWSQ-UHFFFAOYSA-N
PDGXJDXVGMHUIR-UHFFFAOYSA-N
INAPMGSXUVUWAF-UHFFFAOYSA-N
AUFGTPPARQZWDO-UHFFFAOYSA-N

We now need to sort the file

MacPro:~ Chris$ sort /Users/Chris/Desktop/UnixLargeFiles/output.inchikey -o /Users/Chris/Desktop/UnixLargeFiles/sortedoutput.inchikey

Now we can use uniq to identify duplicate structures, it is important to note uniq filters out adjacent, matching lines which is why we need to sort first.

MacPro:~ Chris$ uniq -c /Users/Chris/Desktop/UnixLargeFiles/sortedoutput.inchikey
   1 AAAFZMYJJHWUPN-UHFFFAOYSA-N
   1 AAALVYBICLMAMA-UHFFFAOYSA-N
   1 AANFHDFOMFRLLR-UHFFFAOYSA-N
   3 AAOVKJBEBIDNHE-UHFFFAOYSA-N
   1 AAQOQKQBGPPFNS-UHFFFAOYSA-N
   1 AAWZDTNXLSGCEK-UHFFFAOYSA-N
   1 AAZMHPMNAVEBRE-UHFFFAOYSA-M
   1 AAZMHPMNAVEBRE-UHFFFAOYSA-N
   1 ABBADGFSRBWENF-UHFFFAOYSA-N
   2 ABCOOORLYAOBOZ-UHFFFAOYSA-N
   2 ABETXLPRGREJND-UHFFFAOYSA-M

Alternative options for uniq are

-d Only output lines that are repeated in the input. 
-u Only output lines that are not repeated in the input.

Extracting data from delimited files

Sometimes you have very large files containing multiple columns of data, but you really want is a single column. The file is too big to open in a spreadsheet application but you can extract just the column you want using a simple command. First use head to identify the column of data you want.

MacPro:~ Chris$ head -3 /Users/Chris/Desktop/UnixLargeFiles/alldata.csv 
StructureID,chembl_id,MW,XLogP,HBA,HBD,HAC,RotBondCount,SMILES
0,CHEMBL153534,235.31,1.97,5,2,16,2,Cc1cc(cn1C)-c1csc(n1)N=C(N)N
978,CHEMBL440060,2883.34,-12.48,78,40,202,95,CC[C@H](C)[C@H](NC(=O)[C@H](CC(C)C)NC(=O)[C@@H](NC(=O)[C@@H](N)CCSC)[C@@H](C)O)C(=O)NCC(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)N[C@@H](Cc1c[nH]cn1)C(=O)N[C@@H](CC(=O)N)C(=O)NCC(=O)N[C@@H]  (C)C(=O)N[C@@H](C)C(=O)N[C@@H](CCC(=O)N)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCCN=C(N)N)C(=O)N[C@@H](CCC(=O)N)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCCN=C(N)N)C(=O)NCC(=O)N[C@@H](CCC(=O)N)C(=O)N[C@@H] (CC(C)C)C(=O)NCC(=O)N1CCC[C@H]1C(=O)N1CCC[C@H]1C(=O)NCC(=O)N[C@@H](CO)C(=O)N[C@@H](CCCN=C(N)N)C(=O)N

The column we want is SMILES, which is column 9. We can use cat to extract just this column (replace f9 with the column you want).

MacPro:~ Chris$ cat /Users/Chris/Desktop/UnixLargeFiles/alldata.csv |cut -d ',' -f9 
SMILES
Cc1cc(cn1C)-c1csc(n1)N=C(N)N
CC[C@H](C)[C@H](NC(=O)[C@H](CC(C)C)NC(=O)[C@@H](NC(=O)[C@@H](N)CCSC)[C@@H](C)O)C(=O)NCC(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)N[C@@H](Cc1c[nH]cn1)C(=O)N[C@@H](CC(=O)N)C(=O)NCC(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)N[C@@H] (CCC(=O)N)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCCN=C(N)N)C(=O)N[C@@H](CCC(=O)N)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCCN=C(N)N)C(=O)NCC(=O)N[C@@H](CCC(=O)N)C(=O)N[C@@H]   (CC(C)C)C(=O)NCC(=O)N1CCC[C@H]1C(=O)N1CCC[C@H]1C(=O)NCC(=O)N[C@@H](CO)C(=O)N[C@@H](CCCN=C(N)N)C(=O)N

If you want to save the output to a file

MacPro:~ Chris$ cat /Users/Chris/Desktop/UnixLargeFiles/alldata.csv |cut -d ',' -f9 > /Users/Chris/Desktop/UnixLargeFiles/justSMILES.smi

Dealing with a large number of files

A suggestion from a reader. Sometimes rather than one large file download sites provide the data as a large number of individual files. We can keep track of the number of files using this simple command. In my testing this works fine, but will not count hidden files, and will miscount files that contain space/line breaks etc. in file names.

MacPro:~ Chris$ ls | wc -1
177248

If anyone has any additional suggestions please feel free to submit them.

Last updated 10 June 2018