Macs in Chemistry

Insanely great science

 

A review of FAst MEtabolizer 2 (FAME2)

Whilst much computational work is undertaken to support, library design, virtual screening, hit selection and affinity optimisation the reality is that the most challenging issues to resolve in drug discovery often revolve around absorption, distribution, metabolism and excretion (ADME). Whilst we can measure the levels of parent drug in various medium tracking metabolic fate can often be a considerably more difficult proposition requiring significant resources. For this reason prediction of sites of metabolism has become the subject of current interest.

The challenge of the problem should not be under-estimated, there are multiple different potential enzymic reaction types, those that act directly on the drug (phase I) and those the further functionalize metabolites (phase II), any drug can be a substrate for multiple enzymes. Predictions involving QM calculations or docking to the flexible binding sites of cytochrome P450 enzymes require prohibitive computing resources.

FAME DOI is a collection of random forest models trained on a comprehensive and highly diverse data set of 20,000 small molecules annotated with their experimentally determined sites of metabolism taken from multiple species (rat, dog and human). In addition dedicated models are available to predict sites of metabolism of phase I and II processes. Remarkably this is achieved using only 7 easily calculated descriptors (Table 1), six interpretable atomic descriptors (encoding the element type, hybridization state, and electronic configuration of each atom) and one molecular descriptor (encoding the topological size of a molecule).

FAME 2 DOI builds on this work to improve accuracy, in addition FAME 2 uses a slightly modified version of the visualisation developed by Patrik Rydberg and implemented in SMARTCyp using ChemDoodle Web Components.

It is really useful to have two sites of metabolism tools available that use contrasting methodologies, FAME 2 using curated dataset of experimentally determined metabolism data to build a machine learning model using simple descriptors. In contrast SMARTCyp uses precomputed activation energies from density functional theory (DFT) calculations of model compounds. These are used to predict the reactivity of similar fragments within the target molecule the final score is modified to reflect the accessibility to the active site of the different CYP450 iso forms and improvements for N-oxidations of tertiary amines are included, specifically an empirical corrections to unlikely oxidations of tertiary alkylamines

In FAME 2 rather than using the simple random forest machine learning algorithm used in the original method, an extremely randomised trees approach is used DOI which is a computationally efficient classification algorithm. FAME used a set of 2D descriptors 7 easily calculated descriptors, six interpretable atomic descriptors (encoding the element type, hybridization state, and electronic configuration of each atom) and one molecular descriptor (encoding the topological size of a molecule). In contrast FAME 2 uses circular descriptions of atoms and their environments. As can be seen in the help message below it is possible to change the diameter of the atom encoding fingerprint from 1 to 6. The default 'circCDKATF1' is a model based on the atom itself and its immediate neighbors (atoms at most one bond away).

java -jar /Users/Username/Downloads/fame2/fame2.jar -h
usage: fame2 [-h] [--version] [-m {circCDK_ATF_1,circCDK_4,circCDK_ATF_6}]
             [-s [SMILES [SMILES ...]]] [-o OUTPUT_DIRECTORY] [-p] [-c]
             [FILE [FILE ...]]

This is fame2. It  attempts  to  predict  sites  of metabolism for supplied
chemical compounds. It  includes  extra  trees  models for regioselectivity
prediction of some cytochrome P450 isoforms.

positional arguments:
  FILE                   One or more SDF  files  with compounds to predict.
                         One SDF can contain multiple compounds.
                         All molecules should be  neutral and have explicit
                         hydrogens added prior to  modelling.  If there are
                         still missing hydrogens, the  software will try to
                         add   them    automatically.Calculating    spatial
                         coordinates of atoms is not necessary.

optional arguments:
  -h, --help             show this help message and exit
  --version              Show program version.
  -m {circCDK_ATF_1,circCDK_4,circCDK_ATF_6}, --model {circCDK_ATF_1,circCDK_4,circCDK_ATF_6}
                         Model to use to generate predictions. 
                         Either   the   model   with   the   best   average
                         performance    ('circCDK_ATF_6')     during    the
                         independent test set  validation  as  performed in
                         the original paper or  one  of  the simpler models
                         that were  found  to  have  comparable performance
                         ('circCDK_ATF_1'     and     'circCDK_4').     The
                         'circCDK_ATF_1' model is  selected  by  default as
                         it  is  expected  to   offer  the  best  trade-off
                         between generalization and accuracy.
                         The number  after  the  model  code  indicates how
                         wide the encodedenvironment  of  an  atom  is. For
                         example, the default  'circCDK_ATF_1'  is  a model
                         based  on  the  atom   itself  and  its  immediate
                         neighbors  (atoms   at   most   one   bond  away).
                         (default: circCDK_ATF_1)
  -s [SMILES [SMILES ...]], --smiles [SMILES [SMILES ...]]
                         One  or  more  SMILES   strings  of  compounds  to
                         predict. 
                         All molecules should be  neutral and have explicit
                         hydrogens added prior to  modelling.  If there are
                         still missing hydrogens, the  software will try to
                         add them automatically.
  -o OUTPUT_DIRECTORY, --output-directory OUTPUT_DIRECTORY
                         The path to the  output  directory.  If it doesn't
                         exist, it will be created. (default: fame_results)
  -p, --depict-png       Generates  depictions   of   molecules   with  the
                         predicted  sites  highlighted  as   PNG  files  in
                        addition to the HTML output. (default: false)
  -c, --output-csv       Saves calculated  descriptors  and  predictions to
                         CSV files. (default: false)

The predictions are generated as a simple HTML page (shown below) which displays the structure of the compound with the predicted SoMs highlighted with yellow circles, moving the cursor over the structure reals the atom numbers that correspond to the numbers in the table.

FAME II Output

Produced: 2017-08-15_20-53-43.

Input file: [/Users/username/Desktop/fame2/example_compounds/tamoxifen.sdf].

Visualization:

To alternate between atoms and atom numbers, move the mouse cursor over the figure.

Molecule 2733526
AtomProbability
C.28 0.746
C.27 0.746
C.6 0.696
C.25 0.654
C.26 0.632
C.19 0.088
C.11 0.038
C.22 0.018
C.21 0.018
C.20 0.012
N.2 0.008
C.16 0.006
C.15 0.006
C.18 0.002
C.17 0.002
C.24 0.0
C.23 0.0
C.14 0.0
C.13 0.0
C.12 0.0
C.10 0.0
C.9 0.0
C.8 0.0
C.7 0.0
C.5 0.0
C.4 0.0
C.3 0.0
O.1 0.0

I also used SMARTCyp to predict the sites of metabolism for Tamoxifen, the results are very similar and predict the known routes of metabolism. In particular they flag the CYP2D6 mediated 4-hydroxylation to give the active metabolite 4-hydroxytamoxifen and the the demethylation sites.

tamoxCyp2d6

It is also possible to use SMILES as input

java -jar /Users/Chris/Desktop/fame2/fame2.jar -s 'CC/C(=C(\c1ccccc1)c1ccc(cc1)OCCN(C)C)c1ccccc1'

I looked at the influence of the different models used to generate predictions.

MacPro:~ Chris$ java -jar /Users/Chris/Desktop/fame2/fame2.jar -s 'Cn1cc(cn1)c2ccc3nnc(n3n2)Sc4ccc5c(c4)cccn5' -m circCDK_ATF_1
Selected model: circCDK_ATF_1
Output Directory: fame_results
Loading model...
Processing: [Cn1cc(cn1)c2ccc3nnc(n3n2)Sc4ccc5c(c4)cccn5]
Generating identifier for Cn1cc(cn1)c2ccc3nnc(n3n2)Sc4ccc5c(c4)cccn5: mol_1_1
************** Processing molecule: mol_1_1 **************
WARNING: implicit hydrogens detected for molecule: mol_1_1
Making all hydrogens explicit...
Explicit hydrogens in the original structure: 0
Added hydrogens: 13
Prediction and descriptor calculation finished (mol_1_1). Elapsed time: 1779.6517997 ms.
************** Done (mol_1_1) **************
MacPro:~ Chris$ java -jar /Users/Chris/Desktop/fame2/fame2.jar -s 'Cn1cc(cn1)c2ccc3nnc(n3n2)Sc4ccc5c(c4)cccn5' -m circCDK_4
Selected model: circCDK_4
Output Directory: fame_results
Loading model...
Processing: [Cn1cc(cn1)c2ccc3nnc(n3n2)Sc4ccc5c(c4)cccn5]
Generating identifier for Cn1cc(cn1)c2ccc3nnc(n3n2)Sc4ccc5c(c4)cccn5: mol_1_1
************** Processing molecule: mol_1_1 **************
WARNING: implicit hydrogens detected for molecule: mol_1_1
Making all hydrogens explicit...
Explicit hydrogens in the original structure: 0
Added hydrogens: 13
Prediction and descriptor calculation finished (mol_1_1). Elapsed time: 2453.465267 ms.
************** Done (mol_1_1) **************
MacPro:~ Chris$ java -jar /Users/Chris/Desktop/fame2/fame2.jar -s 'Cn1cc(cn1)c2ccc3nnc(n3n2)Sc4ccc5c(c4)cccn5' -m circCDK_ATF_6
Selected model: circCDK_ATF_6
Output Directory: fame_results
Loading model...
Processing: [Cn1cc(cn1)c2ccc3nnc(n3n2)Sc4ccc5c(c4)cccn5]
Generating identifier for Cn1cc(cn1)c2ccc3nnc(n3n2)Sc4ccc5c(c4)cccn5: mol_1_1
************** Processing molecule: mol_1_1 **************
WARNING: implicit hydrogens detected for molecule: mol_1_1
Making all hydrogens explicit...
Explicit hydrogens in the original structure: 0
Added hydrogens: 13
Prediction and descriptor calculation finished (mol_1_1). Elapsed time: 2632.7860776 ms.
************** Done (mol_1_1) **************

Whilst the default circCDKATF1 is the fastest I found instances where circCDKATF6 gave more accurate results as shown below.

circCDK_ATF_1circCDK_ATF_6

Summary

When I first reviewed FAME2 there were a couple of minor bugs, when I reported them to the developers the bugs were fixed and a new version of FAME2 was made available within a day, really impressive support! Unfortunately (unlike FAME1) FAME2 only predicts CYP450 mediated metabolism, apparently the non-CYP mediated metabolism data was not available to the author.