Macs in Chemistry

Insanely great science

 

Adding titles to molecules in an sdf file

The current COVID-19 pandemic has resulted in many groups sharing chemical structure files, however it is useful to understand the file format definition to do so accurately.

There are many chemistry file formats, iBabel using the OpenBabel toolkit supports around 100 different file formats. One of the most popular is the sdf, or MDL mol file for a single molecule, format, this is a plain text file format that you can open and examine using a text editor. Compound records in a sdf file are separated by $$$$ and each compound record contains several distinct sections.

The first three line block, any or all of which may be left blank, contain


sdf file format

MOE2019           2D

 22 24  0  0  0  0  0  0  0  0999 V2000
    7.2040   -6.7290    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.3790   -6.7290    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    7.6170   -7.4420    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    5.9670   -7.4420    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.3880   -8.1540    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.2120   -8.1620    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   10.0920   -6.0040    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    7.6170   -6.0120    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    5.1420   -7.4500    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   10.9170   -6.0040    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    8.4420   -6.0040    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    8.8540   -5.2920    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    8.8540   -6.7250    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    9.6790   -5.2920    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    9.6880   -6.7250    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   11.3290   -5.2790    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.6250   -8.8750    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   12.1540   -5.2880    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   10.9170   -4.5670    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   12.5620   -4.5670    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   11.3250   -3.8540    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   12.1500   -3.8540    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  1  3  2  0  0  0  0
  1  8  1  0  0  0  0
  2  4  1  0  0  0  0
  3  6  1  0  0  0  0
  4  5  1  0  0  0  0
  4  9  2  0  0  0  0
  5  6  2  0  0  0  0
  6 17  1  0  0  0  0
  7 10  1  0  0  0  0
  7 14  1  0  0  0  0
  7 15  1  0  0  0  0
  8 11  1  0  0  0  0
 10 16  1  0  0  0  0
 11 12  1  0  0  0  0
 11 13  1  0  0  0  0
 12 14  1  0  0  0  0
 13 15  1  0  0  0  0
 16 18  2  0  0  0  0
 16 19  1  0  0  0  0
 18 20  1  0  0  0  0
 19 21  2  0  0  0  0
 20 22  2  0  0  0  0
 21 22  1  0  0  0  0
M  END
>  <IDNUMBER>
mymol123

$$$$

The next section is the "counts" line. This line is made up of twelve fixed-length fields - the first eleven are three characters long, and the last six characters long. The first two fields give the number of atoms and bonds described in the compound.

22 24  0  0  0  0  0  0  0  0999 V2000

The next section is atoms block, the first three fields are the x,y and z coordinates (z = 0.000 fir 2D molecules) followed by the atom symbol

7.2040   -6.7290    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0

After the atoms block, you next specify the bonds between them, the first two fields are the indexes of the atoms included the bond (starting from 1). The third field defines the bond type and the fourth the stereochem of the bond

 1  2  1  0  0  0  0

The M END is essential to define the end of the molecule properties list

After that it is possible to add custom data fields, after the name of the data field, a data field can contain one or more lines of up to 200 characters of free text, which is the value of the data field. In this case there is an identifier for the molecule called IDNUMBER and the value is mumol123.

>  <IDNUMBER>
mymol123

In a sdf each molecule record can have different data fields, so you need to read the entire sdf file to identify all the data fields that are present.

When handling, searching, combining sdf files it is often very useful to have a unique identifier to act as a key for each molecule, many software packages expect this to be the first line or title, unfortunately this line is often left blank and the unique identifier is held in a data field that might be called "ID", "IDNUMBER", "MOL", "molNAME"……..

iBabel can be used to generate the title in these cases. Add the file for conversion into the input field, then click "List Fields", the list of data fields in the file will appear in the obabel command box (highlighted in green).

titleinput

You can now use the "Append to title" option and enter "IDNUMBER" into the text field to add the "IDNUMBER" to the title, if you click the "Check" button the actual obabel command will appear in the window and you can see the —append option has been added. You can click on "Run" to run the command.

addtitle

If we now open the resulting sdf file we can see the title has now been added, and the software used to generate the molecular structure updated.

mymol123
 OpenBabel04142012102D

 22 24  0  0  0  0  0  0  0  0999 V2000
    7.2040   -6.7290    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.3790   -6.7290    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0

In some instances there is no unique identifier in the file in these cases you can still use iBabel with a minor edit to the command. In this case select "Add title" the text you add in the box will be added to every record in the file so we need to add an extra option. Click on the "Check" button and then add

--addoutindex

As shown below, then click "Run".

addoutindex

The first few lines of the resulting file are shown below.

mol 1
OpenBabel04142012242D

22 24  0  0  0  0  0  0  0  0999 V2000

You can also use this process to add calculated properties such as MWt.

Last Updated 15 April 2020