1. S Joshua Swamidass
  2. Scaffold Network Generator

Wiki

Clone wiki

Scaffold Network Generator / Home

Scaffold Network Generator

Table of Contents



Introduction to Scaffold Network Generator

Welcome to the home page for Scaffold Network Generator, an open-source, Java-based tool for generating scaffold networks for millions of molecules at a time. Active molecules in high-throughput screening assays are frequently observed to share common scaffolds, which are of interest to screeners and analysts.

A scaffold is defined as the molecular structure that remains once the terminal side chains of a molecule have been removed. A scaffold then consists of rings and the linking chains of atoms between those rings. With the additional step of pruning a terminal ring from a molecular structure, the scaffold generation process may be repeatedly applied to yield a hierarchy of successively more general molecular backbones. Analysis of the molecular scaffold hierarchies elucidated with SNG reveal higher-level trends between structure and activity for a particular screening assay.

SNG serves as a high-throughput tool for the generation of such scaffold hierarchies from small-molecule assays and can aid in the process of discovering novel scaffolds implicated in drug structure-activity relationships.

SNG can generate Murcko scaffolds (Bemis, 1996) and build hierarchical relational networks among the core scaffolds of a molecule by iteratively pruning rings from the scaffold. SNG is capable of exploring the scaffold space using two common techniques:

  1. Scaffold Trees (Schuffenhauer, 2007) explore the scaffold space by iteratively pruning the least-characteristic ring from a molecular scaffold (the least characteristic ring is defined by a complex of dataset independent rules). This process results in a tree of molecular scaffolds.
  1. Scaffold Networks (Varin, 2011) explore the scaffold space by pruning each available ring and generating all possible subscaffolds for a given set of input molecules. The scaffold network algorithm was created to address the inherent limitations in scaffold tress. That is, while scaffold trees only explore a subset of the possible scaffold space (by iteratively pruning ONLY the least characteristic ring), scaffold networks can explore the entire scaffold space. This process results in a directed, acyclic network of molecular scaffolds.

SNG incorporates source code from the Chemistry Development Kit (Steinbeck, 2003) and Scaffold Hunter (Wetzel, 2009), an open-source tool for the creation and visualization of scaffold trees.

Functionality of SNG Versus Existing Software

There are two open source programs released prior to SNG that can be used to generate scaffold trees. One is Scaffold Hunter (SH), which is a GUI based program that has been published and has an active website and user-group for continued development. The other is Scaffold Tree Generator (STG), an antiquated command-line implementation of the tree generation algorithm of Scaffold Hunter. The philosophical objective of SNG, SH, and STG are the same; both programs depict scaffold spaces of a given set of molecules that can be analyzed in early stage drug-discovery to identify and optimize lead candidates. However they seek to accomplish this task in fundamentally different ways. SNG is designed to be used for high-throughput automated analysis of large, structurally diverse libraries (>225,000 compounds); SH and STG were designed to be used by chemists manually interacting with the scaffold space via a GUI.

The most significant difference between SNG and SH/STG is in their calculation of scaffold networks versus scaffold trees and their potential utilization in an automated system. The calculation of scaffold networks is a task best suited toward an automated approach of scaffold space analysis, whereas visual analysis works best with the simpler depiction paradigm afforded by scaffold trees. Similarly, the higher dataset capacity of SNG is more pertinent for an automated solution than for a manual solution such as SH. To help end-users decide which tool is best suited to their needs we summarize the functional differences between SNG, SH, and STG in the table below.

SNGSHSTG
Computes Scaffold NetworksX--
Computes Scaffold TreesXXX
Automatable Command-line InterfaceX-X
Manual GUI-based Interface-X-
Results from runs made in parallel can be mergedX--
Upper bound on number of input molecules*10,000,000200,000**10,000,000
Run-time for a set of 150,000 molecules*37m 3s, 17m 3s***50m76m 39s
Allows navigation of scaffold spaces using the molecular properties of parent scaffolds-X-
Identifies when and why molecules fail during program executionX--
Suffers from a CDK error where SMILE input of double bonds in heteroaromatic rings are occasionally parsed internally as single bonds-XX
Suffers from a CDK canonization error whereby the multiple instances of the same molecule having different input SMILE string are represented twice in the output scaffold tree--X

* Tests were performed on a Mac Intel Core i5 2.7 GHz with 8 Gb of Ram. JVM was set to a maximum memory allowance of 2 gigabytes using -Xmx2048m option for all tested applications.

** The maximum number of scaffolds that can be viewed through the GUI is 2,000.

*** The runtime when the 150,000 job is split into 4 jobs run on 4 parallel processes is 17m 3s.

Installation

Linux/MacOSX

We provide an installer script for Mac OSX and Linux systems.

The latest stable release of SNG with the Linux/MacOSX installer can be downloaded here.

Extract the downloaded file from the command line:

$ tar -zxvf sng-1.0.tar.gz
$ cd sng-1.0

and then run the install script to install sng onto the system path (this requires super user privileges):

$ sudo ./install.sh

finally, restart your shell (or reload your profile) and you should now be able to run sng:

$ sng

First argument must be either generate, aggregate, select, or image.

Windows

An installer script is also provided for windows. You can download the windows distribution here.

Extract the zip file using any zip utility. The install script can be invoked via the PowerShell terminal program. Locate the power shell program, right-click it and choose "Run as Administrator". Then execute the following commands:

$ cd <extracted zip directory>
$ .\install.ps1

SNG is installed to C:\ScaffGen by default. Finally, close the power shell and open a new terminal or power shell. You should now be able to run sng:

$ sng

First argument must be either generate, aggregate, select, or image.

Optional Libraries

When first running sng, you may encounter the following warning:

WARNING: Openbabel library not available, falling back to CDK for SMILES canonicalization

SNG has optional external dependencies to the OpenBabel library (OBoyle, 2011) which is used to canonicalize SMILES strings and convert SMILES files to SDF format prior to loading using CDK. This additional conversion step is necessary to avoid some errors that we observed in using CDK, and were able to reproduce in the original Scaffold Hunter application.

  • Many SMILES strings were not canonicalized properly. This bug is reproducible in the original Scaffold Hunter application, on which the Scaffold Tree code for SNG is based.
  • When loading SMILES strings from files, double bonds are occasionally replaced by hydrogens in heteroaromatic rings.

To avoid these errors, we recommend following these instructions to build and install OpenBabel from source. When performing this procedure remember the following:

  • In order to compile the java bindings you will need to pass the -DJAVA_BINDINGS=ON argument to the cmake command
  • After make install, the generated library libopenbabel_java.so will NOT be in the Java system path. You will need to install this file in your java.library.path. We recommend installing it into /usr/lib/java on MacOSX or to (your jvm directory)/jre/lib/(your architecture), on a Linux system, where (your architecture) is likely to be one of i386, i686 or amd64. On linux, JVMs are usually stored in /usr/lib/jvm.
  • For Mac OS X users, there is a known bug with the compiler output. You will need to rename libopenbabel_java.so to libopenbabel_java.jnilib before installing it in your java.library.path. Additionally, the linker may incorrectly configure the generated library to point to the compiled copy of libopenbabel and not the installed copy. If you get an unexpected segfault, try this workaround. The link applies to the python bindings, but can be easily adapted to be used on the libopenbabel_java.so.
  • If other issues are encountered (such as an unexpected segfault), we recommend trying the solutions referenced here.

Building SNG From Source

If you would like to build SNG from the latest sources, you will need to have the Apache Ant build system and Mercurial installed.

Start by cloning our source repository:

$ hg clone https://bitbucket.org/swamidass/scaffold-network-generator

Next execute the following commands to build and install SNG:

$ cd scaffold-network-generator
$ ant build
$ sudo ant install

If you are in a windows environment, replace the last command with:

$ sudo ant winstall

Now, restart your shell (or reload your profile), and try running SNG:

$ sng

First argument must be either generate, aggregate, select, or image.

Usage

Once installed, SNG can be invoked on Mac OS X or Linux via the following command:

$ sng [command] [options] [input files]

Where ''command'' is one of: generate, aggregate, select, or image. ''input files'' can be an arbitrarily long list of input molecule files. The valid ''options'' depend on what ''command'' is being invoked.

Usage Tutorial

Suppose we have two files containing compounds in smiles and sdf format respectively which we want to process into a scaffold network.

$ ls
test1.sdf   test2.smi

First, we must generate the scaffold decompositions:

$ sng generate -o test1.tmp test1.sdf
$ sng generate -o test2.tmp test2.smi

Finally, we aggregate the temporary output files into our final network structure:

$ sng aggregate -o test.network test1.tmp test2.tmp

The network is now available in TSV format, stored in the file test.network.

Descriptions of all of the potential options available as input to scaffold network generator are explained below.

Scaffold Network and Tree Generation

SNG is designed to be run on massive datasets in a parallelized environment. Therefore, in order to generate a scaffold network, a two stage process is employed. First, we recommend breaking up a dataset into many small files containing between 10000 and 100000 molecules each. Then, for each file in the dataset, we invoke SNG using the following example:

$ sng generate [options] -o output_file_xxx.tmp input_file_xxx.smi

We recommend running each individual input file on a separate processor core, utilizing cluster computing if available.

The ''generate'' command allows the following options:

    -o <output file>     : Place the results into the specified file (default output goes to standard out)
    -d                   : Deglycosilate input molecules prior to scaffold generation
    -r                   : Generate a scaffold tree (default is scaffold network generation)
    -p N                 : Prune molecules from the dataset with greater than N rings (default = 10)
    -t N                 : Sets the maximum amount of processor time (in seconds) allocated to process an individual scaffold (default=30)

Invoking ''sng generate'' without any arguments will print usage and available options.

After all input files have been processed by ''sng generate'' we must aggregate the results into the final output format. This can be accomplished with the following command:

$ sng aggregate [options] -o result.scaffold output_file_1.tmp output_file_2.tmp ...

The ''aggregate'' command allows the following options:

    -o <output file>    : Place the results into the specified file (default output goes to standard out)
    -s                  : Generate SVG images for each molecule in the scaffold
    -m <file>           : Map molecule IDs from the original input files to scaffold IDs and place the result 
in the given file
    -d                  : Write the output to SDF format (overrides --svg)
    -a <file>           : Output annotations of side-chain positions observed in the data to the specified file

The final scaffold tree or network can then be found in the file "result.scaffold".

Note that the result of the SDF output option cannot be used as input for the select stage, however, it is possible to output SDF from the select stage.

Selecting Subsets

As a convenience function, SNG provides a utility for selecting a subset of the molecular scaffold network using a molecule-based query (for example, for selecting only the scaffolds for structures of interest from a high-throughput screen). This command operates on the output of the ''sng aggregate'' command described above.

$ sng select [options] -o <output scaffold file> <input scaffold file> <input search molecules>

The output is in the same format as that of ''SNG aggregate''. Input search molecules should be given in SMI or SDF formats.

Valid options for ''SNG select'' include:

    -o <output file>    : Place the results into the specified file (default output goes to standard out)
    -d                  : Deglycosilate input search molecules prior to searching for scaffold
    -s                  : Include (or generate) SVG images for each molecule in the scaffold
    -d                  : Write the output to SDF format (overrides --svg)

Generating Images

As another convenience, SNG provides the ability to generate SVG images for a set of input molecules without generating any scaffolds. This command can take an input set of molecules in SMILES or SDF format.

$ sng image -o <output file> <input file>

SNG image takes only one option.

    -o <output file>    : Place the results into the specified file (default output goes to standard out)

Supported Input Formats

SNG supports input files in both SMILES and SDF formats. We recommend converting other molecular file formats to these formats using OpenBabel.

SMILES Specification

SNG expects SMILES input files to conform to the specifications set forth in the OpenSMILES standard documentation, section 4.5 (http://opensmiles.org/opensmiles.pdf). This is the standard supported by other chemical software including OpenBabel.

More explicitly: SNG expects files without a header line in which each line should begin with a SMILES string, followed by whitespace, followed by a molecule ID. OpenSMILES stipulates that any data can be placed after the whitespace, however, we require a molecule ID so that generated scaffolds can be traced back to their source molecule.

An example supported SMILES input file is shown for reference:

C1Cc2c(C1)cno2	121
O=C(c1noc2c1CCC2)Nc1ccccc1	122
N=c1scc[nH]1	123
N=c1sc2c([nH]1)cccc2	124
O=C(c1ccccc1)/N=c/1\sc2c([nH]1)cccc2	125
c1ccc(cc1)Oc1ccc2c(c1)scn2	126
O=c1ccocc1	127
O=c1ccoc2c1cccc2	128
O=c1cc(/C=C/c2ccccc2)oc2c1cccc2	129

Output Format

TSV output

By default, SNG generates output files in tab-separated values format. A header line is included in each output file indicating the data contained in each column of the file. Typical outputs will differ based on command and options. SVG output is given in base64 encoded XML format. In order to view these images, they must first be decoded using a base64 decoder with the standard character set.

SDF output

Additionally, it is possible to write output from the aggregate or select tools to the SDF format using the -d flag. The resulting SDF is formatted according to the SDF specification, with the following additional fields:

  • The TITLE field is set to the scaffold's ID number
  • The SUBSCAFFOLDS field is set to the list of IDs of the scaffold IDs
  • The RINGS field is set to the number of rings contained in the current scaffold
  • The SMILES field contains a canonical SMILES representation of the node

Scaffold Network and Tree Generation

The ''generate'' command will produce an intermediate scaffold file in TSV format containing 4 columns:

  • Number of rings
  • SMILES molecule structure
  • Sub-scaffold structures in SMILES format
  • Molecule ID which generated the scaffold (for top-level scaffolds)

Example:

RINGS   SMILES              SUBSCAFFOLDS                MOLECULES
1       c1ccccc1                                   
1       c1ccccn1                                   
2       c1(c2cccc1)cccn2    c1ccccc1,c1ccccn1           101234

after these intermediates are aggregated using the ''aggregate'' command, the following output format will result (assuming the -s option was used, otherwise ignore the SVG column):

ID  RINGS   SMILES             SVG             SUBSCAFFOLDS
0   1       c1ccccc1           PHN34sdfc...
1   1       c1ccccn1           PHN34sdfc...
2   2       c1(c2cccc1)cccn2   PHN54zGfh...    0,1

Sample output if the user has chosen SDF output format with the -d option can be found here

if a user has specified the -m option to output a map of molecule ids to scaffold ids, that file will appear in TSV format as shown below:

MOLECULE_ID     SCAFFOLD_ID
101234          2

where one molecule ID, scaffold ID pair appears per line. There may be multiple molecule IDs mapping to a single scaffold ID.

If a user has specified the -a option to output a map of scaffold ids to annotated side-chain locations, that file will appear in TSV format as shown below:

SCAFFOLD_ID     ANNOTATIONS
22              [*]c1ccccc1
22              [*]c1cc([*])c([*])cc1

where one scaffold ID, annotated structure pair appears per line. There may be multiple annotations mapping to a single scaffold ID. The empty atom [*] symbol represents the position of the pruned side chain.

Selecting Subsets

When submitting a molecular query to an aggregated network file, output data will be the same formats as those listed above.

Generating Images

When generating SVG images for a set of input molecules, output data will be in TSV format and will include a SMILES column and an SVG column:

SMILES             SVG
c1ccccc1           PHN34sdfc...
c1ccccn1           PHN34sdfc...
c1(c2cccc1)cccn2   PHN54zGfh...

Troubleshooting

With large datasets, you are likely to run into issues with the Java Virtual Machine (JVM) maximum memory constraints.

SNG can be configured to run with a specified maximum JVM memory by setting the JAVA_MEM environment variable (the default is 128 megabytes). For example, to allow SNG to use 1024 megabytes of memory type the following on the command line before invoking SNG:

$ export JAVA_MEM=1024m

References

  • Bemis, G. W. and Murcko, M. A. (1996). The properties of known drugs. 1. molecular frameworks. Journal of Medicinal Chemistry, 39(15), 2887–2893.
  • OBoyle, N., Banck, M., James, C., Morley, C., Vandermeersch, T., and Hutchison, G. (2011). Open babel: An open chemical toolbox. Journal of Cheminformatics, 3, 1–14. 10.1186/1758-2946-3-33.
  • Schuffenhauer, A., Ertl, P., Roggo, S., Wetzel, S., Koch, M. A., and Waldmann, H. (2007). The scaffold tree visualization of the scaffold universe by hierarchical scaffold classification. Journal of Chemical Information and Modeling, 47(1), 47–58. PMID: 17238248.
  • Steinbeck, C., Han, Y., Kuhn, S., Horlacher, O., Luttmann, E., and Willighagen, E. (2003). The chemistry development kit (cdk): an open-source java library for chemo- and bioinformatics. Journal of Chemical Information and Computer Sciences, 43(2), 493–500. PMID: 12653513.
  • Varin, T., Schuffenhauer, A., Ertl, P., and Renner, S. (2011). Mining for bioactive scaffolds with scaffold networks: Improved compound set enrichment from primary screening data. Journal of Chemical Information and Modeling, 51(7), 1528–1538.
  • Wetzel, S., Klein, K., Renner, S., Rennerauh, D., Oprea, T. I., Mutzel, P., and Waldmann, H. (2009). Interactive exploration of chemical space with scaffold hunter. Nat Chem Biol, 1875(8), 581–583.

Updated