Unix commands for helping deal with very large files
I'm regularly handling very large files containing millions for chemical structures and whilst BBEdit is my usual tool for editing text files in practice it becomes rather cumbersome for really large files (> 2 GB). In these cases I've compiled a useful list of UNIX commands that make life easier. Whilst I use them when dealing with large chemical structure files they are equally useful when dealing with any large text or data files.
If you start up the Terminal application which is in the Utilities folder in the Applications folder. You should see a window something like this:-
The shell prompt (or command line) is where one types commands, the appearance may differ slightly depending on customisation.
File and Path Names
One thing to remember is that UNIX is less accommodating than Mac OS X when it comes to file name and the paths to files. When you see the full path to a file it will be shown as a list of directory names starting from the top of the tree right the way down to the level where your file sits. It will look something like this.
Directory names can contain spaces but you’ll often find that people working in Unix prefer to keep both file and directory names without spaces since this simplifies the process of working with these files on a command line. Directory or file names containing a space must be specified on the command line either by putting quotes around them, or by putting a \ (backslash) before the space.
'/Users/Username/Projects/bioconda recipes/recipes/rdkit/2016.03.3/meta.yaml '
If typing out the full path seems rather daunting remember if you drag a file in the Finder and drop it on a Terminal window the path to the file will be automatically generated.
Examining Large files
The other thing to remember is that there is an extensive help available, so if you can't remember the command line arguments "man" can provide the answers. Simply type man followed by the command.
MacPro:~ Chris$ man wc WC(1) BSD General Commands Manual WC(1) NAME wc -- word, line, character, and byte count SYNOPSIS wc [-clmw] [file ...] DESCRIPTION The wc utility displays the number of lines, words, and bytes contained in each input file, or standard input (if no file is specified) to the standard output. A line is defined as a string of characters delimited by a <newline> character. Characters beyond the final <newline> charac- ter will not be included in the line count. A word is defined as a string of characters delimited by white space characters. White space characters are the set of characters for which the iswspace(3) function returns true. If more than one input file is specified, a line of cumulative counts for all the files is displayed on a separate line after the output for the last file. The following options are available: -c The number of bytes in each input file is written to the standard output. This will cancel out any prior usage of the -m option. -l The number of lines in each input file is written to the standard output. -m The number of characters in each input file is written to the standard output. If the current locale does not support multi- byte characters, this is equivalent to the -c option. This will cancel out any prior usage of the -c option. -w The number of words in each input file is written to the standard output.
In addition to quickly view the man page for any command, just right-click on the command in the Terminal and choose Open man Page from the context menu. A new window will pop up, displaying the manual for that command.
So the command wc (word count) can be used to count words, lines, characters, or bytes depending on the option used. For example for a file containing SMILES strings the following command tells us how many lines (and hence the number of molecules). In this case the 4GB file contains 66,783,025 molecules.
MacPro:~ Chris$ wc -l /Users/Chris/Desktop/pubchem_2015_03_via_mol2img.smi 66783025 /Users/Chris/Desktop/pubchem_2015_03_via_mol2img.smi
If instead of counting lines we count words using the -w option, we get a different answer. This is because SMILES string can contain salts, counterions etc.
MacPro:~ Chris$ wc -w /Users/Chris/Desktop/pubchem_2015_03_via_mol2img.smi 133566050 /Users/Chris/Desktop/pubchem_2015_03_via_mol2img.smi
All the structures from ChEMBL are available for download as a 3.5 GB sdf file. Attempting to open such a file in a desktop text editor would not be recommended, however using a few UNIX commends we can interrogate the file to get useful information. For example the following tells us how many lines there are in the file.
Chris$ wc -l /Users/Chris/Desktop/chembl_22.sdf 119893133 /Users/Chris/Desktop/chembl_22.sdf
However in this case since each molecule record in a sdf file has multiple lines we don't know how many structures there are. We can however look at a small portion of the file using the head command
MacPro:~ Chris$ man head HEAD(1) BSD General Commands Manual HEAD(1) NAME head -- display first lines of a file SYNOPSIS head [-n count | -c bytes] [file ...] DESCRIPTION This filter displays the first count lines or bytes of each of the speci- fied files, or of the standard input if no files are specified. If count is omitted it defaults to 10.
So if we type
MacPro:~ Chris$ head -50 /Users/Chris/Desktop/chembl_22.sdf CHEMBL153534 SciTegic12231509382D 16 17 0 0 0 0 999 V2000 7.6140 -22.2702 0.0000 C 0 0 5.7047 -23.1991 0.0000 C 0 0 6.1806 -22.5282 0.0000 N 0 0 6.9604 -22.7690 0.0000 C 0 0 4.8790 -23.2163 0.0000 N 0 0 8.2791 -21.1119 0.0000 N 0 0 7.5280 -21.4445 0.0000 C 0 0 8.4225 -22.4364 0.0000 C 0 0 8.8353 -21.7198 0.0000 C 0 0 6.2035 -23.8527 0.0000 S 0 0 4.0534 -23.2163 0.0000 C 0 0 6.9776 -23.5889 0.0000 C 0 0 3.6406 -22.4938 0.0000 N 0 0 3.6406 -23.9215 0.0000 N 0 0 8.4397 -20.3035 0.0000 C 0 0 9.6552 -21.6280 0.0000 C 0 0 2 3 2 0 3 4 1 0 4 1 1 0 5 2 1 0 6 7 1 0 7 1 2 0 8 1 1 0 9 8 2 0 10 12 1 0 11 5 2 0 12 4 2 0 13 11 1 0 14 11 1 0 15 6 1 0 16 9 1 0 6 9 1 0 10 2 1 0 M END > <chembl_id> CHEMBL153534 $$$$ CHEMBL440060 11280716012D 1 1.00000 0.00000 0 206208 0 1 0 999 V2000 2.0792 -0.9500 0.0000 N 0 0 0 0 0 0 0 0 0 1.4671 -0.5958 0.0000 C 0 0 2 0 0 0 0 0 0 0.8510 -0.9500 0.0000 C 0 0 0 0 0 0 0 0 0 0.8510 -1.6583 0.0000 O 0 0 0 0 0 0 0 0 0
We can now see that each molecule record is separated by $$$$ and we can use this to identify how many molecules are in the file using grep and counting the number of occurrences.
MacPro:~ Chris$ man grep SYNOPSIS grep [-abcdDEFGHhIiJLlmnOopqRSsUVvwxZ] [-A num] [-B num] [-C[num]] [-e pattern] [-f file] [--binary-files=value] [--color[=when]] [--colour[=when]] [--context[=num]] [--label] [--line-buffered] [--null] [pattern] [file ...] DESCRIPTION The grep utility searches any given input files, selecting lines that match one or more patterns. By default, a pattern matches an input line if the regular expression (RE) in the pattern matches the input line without its trailing newline. An empty expression matches every line. Each input line that matches at least one of the patterns is written to the standard output.
So if we type this in the terminal window.
MacPro:~ Chris$ grep -c '$$$$' /Users/Chris/Desktop/chembl_22.sdf 1678393
In a similar manner we can use tail to look at the end of the file.
MacPro:~ Chris$ tail -n10 /Users/Chris/Desktop/chembl_22.sdf 22 24 1 0 8 20 1 0 8 17 1 0 54 55 2 0 8 60 1 6 M END > <chembl_id> CHEMBL3706412 $$$$
Sometimes you just want to test a new tool on a subset of a large file, in this case we can use head and pipe the results into a new file.
MacPro:~ Chris$ head -10000 /Users/Chris/Desktop/pubchem_2015_03_via_mol2img.smi > /Users/Chris/Desktop/UnixLargeFiles/test.txt
Dealing with Errors in very large files
Sometimes files contain non-printable ASCII characters and it can be hard to track these down. Whilst you can use BBEdit to zap these characters but for larger files this command might be a better option.
You can use this Perl command to strip these characters by piping your file through it:
perl -pi.bak -e 's/[\000-\007\013-\037\177-\377]//g;'
If that does not work the best option is often to simply split a file into smaller files
MacPro:~ Chris$ man split -l line_count Create smaller files n lines in length. -p pattern The file is split whenever an input line matches pattern, which is interpreted as an extended regular expression. The matching line will be the first line of the next output file. This option is incompatible with the -b and -l options. If additional arguments are specified, the first is used as the name of the input file which is to be split. If a second additional argument is specified, it is used as a prefix for the names of the files into which the file is split. In this case, each file into which the file is split is named by the prefix followed by a lexically ordered suffix using suffix_length characters in the range ``a-z''. If -a is not specified, two letters are used as the suffix. If the name argument is not specified, the file is split into lexically ordered files named with the prefix ``x'' and with suffixes as above.
We can split the file test.txt created earlier
MacPro:~ Chris$ split -l 100 /Users/Chris/Desktop/UnixLargeFiles/test.txt /Users/Chris/Desktop/UnixLargeFiles/splitfile
Once you have worked out where the error is and corrected it the files can be joined back together using cat (concatenate). The command below concatenates all the splitfile* (using * as a wild card) text files and writes them to a new file
MacPro:~ Chris$ cat /Users/Chris/Desktop/UnixLargeFiles/splitfile* > /Users/Chris/Desktop/UnixLargeFiles/correctedfile.txt
You can then use wc to check the concatenated file has the correct number of lines.
cat is also the swiss army knife for file manipulation, combined with tr
MacPro:~ Chris$ man tr TR(1) BSD General Commands Manual TR(1) NAME tr -- translate characters SYNOPSIS tr [-Ccsu] string1 string2 tr [-Ccu] -d string1 tr [-Ccu] -s string1 tr [-Ccu] -ds string1 string2 DESCRIPTION The tr utility copies the standard input to the standard output with substitution or deletion of selected characters.
To convert a tab delimited file into commas
cat tab_delimited.txt | tr "\\t" "," comma_delimited.csv
Monitoring output files
The command tail has two special command line option -f and -F (follow) that allows a file to be monitored. Instead of just displaying the last few lines and exiting, tail displays the lines and then monitors the file. As new lines are added to the file by another process, tail updates the display. This very useful if you are monitoring a log file to check for errors when processing a very large file.
tail -f /Users/Chris/Desktop/UnixLargeFiles/output.log
Checking for duplicate structures
One of the issues with combining multiple data sets is there is always the risk of duplicate structures. In order to check for this you need a unique identifier for each molecular structure, I tend to use InChIKey.
The IUPAC International Chemical Identifier InChI is a textual identifier for chemical substances, designed to provide a standard way to encode molecular information and to facilitate the search for such information in databases and on the web. The condensed, 27 character InChIKey is a hashed version of the full InChI (using the SHA-256 algorithm).
First use OpenBabel to generate the InChiKey file
MacPro:~ Chris$ /usr/local/bin/babel -ismi '/Users/Chris/Desktop/UnixLargeFiles/withdupsfile.smi' -oinchikey '/Users/Chris/Desktop/UnixLargeFiles/output.inchikey'
We can check the file using head
MacPro:~ Chris$ head -10 /Users/Chris/Desktop/UnixLargeFiles/output.inchikey RDHQFKQIGNGIED-UHFFFAOYSA-N RDHQFKQIGNGIED-UHFFFAOYSA-O INCSWYKICIYAHB-UHFFFAOYSA-N HXKKHQJGJAFBHI-UHFFFAOYSA-N HIQNVODXENYOFK-UHFFFAOYSA-N VYZAHLCBVHPDDF-UHFFFAOYSA-N MUIPLRMGAXZWSQ-UHFFFAOYSA-N PDGXJDXVGMHUIR-UHFFFAOYSA-N INAPMGSXUVUWAF-UHFFFAOYSA-N AUFGTPPARQZWDO-UHFFFAOYSA-N
We now need to sort the file
MacPro:~ Chris$ sort /Users/Chris/Desktop/UnixLargeFiles/output.inchikey -o /Users/Chris/Desktop/UnixLargeFiles/sortedoutput.inchikey
Now we can use uniq to identify duplicate structures, it is important to note uniq filters out adjacent, matching lines which is why we need to sort first.
MacPro:~ Chris$ uniq -c /Users/Chris/Desktop/UnixLargeFiles/sortedoutput.inchikey 1 AAAFZMYJJHWUPN-UHFFFAOYSA-N 1 AAALVYBICLMAMA-UHFFFAOYSA-N 1 AANFHDFOMFRLLR-UHFFFAOYSA-N 3 AAOVKJBEBIDNHE-UHFFFAOYSA-N 1 AAQOQKQBGPPFNS-UHFFFAOYSA-N 1 AAWZDTNXLSGCEK-UHFFFAOYSA-N 1 AAZMHPMNAVEBRE-UHFFFAOYSA-M 1 AAZMHPMNAVEBRE-UHFFFAOYSA-N 1 ABBADGFSRBWENF-UHFFFAOYSA-N 2 ABCOOORLYAOBOZ-UHFFFAOYSA-N 2 ABETXLPRGREJND-UHFFFAOYSA-M
Alternative options for uniq are
-d Only output lines that are repeated in the input. -u Only output lines that are not repeated in the input.
Extracting data from delimited files
Sometimes you have very large files containing multiple columns of data, but you really want is a single column. The file is too big to open in a spreadsheet application but you can extract just the column you want using a simple command. First use head to identify the column of data you want.
MacPro:~ Chris$ head -3 /Users/Chris/Desktop/UnixLargeFiles/alldata.csv StructureID,chembl_id,MW,XLogP,HBA,HBD,HAC,RotBondCount,SMILES 0,CHEMBL153534,235.31,1.97,5,2,16,2,Cc1cc(cn1C)-c1csc(n1)N=C(N)N 978,CHEMBL440060,2883.34,-12.48,78,40,202,95,CC[C@H](C)[C@H](NC(=O)[C@H](CC(C)C)NC(=O)[C@@H](NC(=O)[C@@H](N)CCSC)[C@@H](C)O)C(=O)NCC(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)N[C@@H](Cc1c[nH]cn1)C(=O)N[C@@H](CC(=O)N)C(=O)NCC(=O)N[C@@H] (C)C(=O)N[C@@H](C)C(=O)N[C@@H](CCC(=O)N)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCCN=C(N)N)C(=O)N[C@@H](CCC(=O)N)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCCN=C(N)N)C(=O)NCC(=O)N[C@@H](CCC(=O)N)C(=O)N[C@@H] (CC(C)C)C(=O)NCC(=O)N1CCC[C@H]1C(=O)N1CCC[C@H]1C(=O)NCC(=O)N[C@@H](CO)C(=O)N[C@@H](CCCN=C(N)N)C(=O)N
The column we want is SMILES, which is column 9. We can use cat to extract just this column (replace f9 with the column you want).
MacPro:~ Chris$ cat /Users/Chris/Desktop/UnixLargeFiles/alldata.csv |cut -d ',' -f9 SMILES Cc1cc(cn1C)-c1csc(n1)N=C(N)N CC[C@H](C)[C@H](NC(=O)[C@H](CC(C)C)NC(=O)[C@@H](NC(=O)[C@@H](N)CCSC)[C@@H](C)O)C(=O)NCC(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)N[C@@H](Cc1c[nH]cn1)C(=O)N[C@@H](CC(=O)N)C(=O)NCC(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)N[C@@H] (CCC(=O)N)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCCN=C(N)N)C(=O)N[C@@H](CCC(=O)N)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCCN=C(N)N)C(=O)NCC(=O)N[C@@H](CCC(=O)N)C(=O)N[C@@H] (CC(C)C)C(=O)NCC(=O)N1CCC[C@H]1C(=O)N1CCC[C@H]1C(=O)NCC(=O)N[C@@H](CO)C(=O)N[C@@H](CCCN=C(N)N)C(=O)N
If you want to save the output to a file
MacPro:~ Chris$ cat /Users/Chris/Desktop/UnixLargeFiles/alldata.csv |cut -d ',' -f9 > /Users/Chris/Desktop/UnixLargeFiles/justSMILES.smi
Dealing with a large number of files
A suggestion from a reader. Sometimes rather than one large file download sites provide the data as a large number of individual files. We can keep track of the number of files using this simple command. In my testing this works fine, but will not count hidden files, and will miscount files that contain space/line breaks etc. in file names.
MacPro:~ Chris$ ls | wc -1 177248
If anyone has any additional suggestions please feel free to submit them.
Last updated 10 June 2018