Source

fmcs /

Filename Size Date modified Message
benchmark
sample_files
tests
52 B
1.8 KB
1.5 KB
505 B
10.4 KB
1.2 KB
152 B
111.1 KB
375 B
FMCS 1.0b2 - Find Maximum Common Substructure

Find the MCS of a group (or cluster) of chemical structures and report the result as a SMARTS string. It depends on RDKit cheminformatics toolkit.

More specifically, the MCS found is a common edge subgraph, and not a common induced subgraph. Only connected MCSes are found; if you want support for disconnected subgraphs then feel free to fund me for it.

This work was funded by Roche and implemented by Andrew Dalke Scientific AB. The software copyright is held by Dalke Scientific and released to the public under the New/2-clause BSD license. See "COPYING" for details.

INSTALLATION

To install this package:

sudo python setup.py install

For details on how to use the Python 'setup.py' tool, see

http://docs.python.org/install/index.html

USAGE:

Here is the command-line usage using "fmcs --help"

usage: fmcs [-h] [--maximize {atoms,bonds}] [--min-num-atoms INT]
[--atom-compare {any,elements}] [--bond-compare {any,ignore-aromaticity,bondtypes}] [--compare {topology,elements,ignore-aromaticity,types}] [--ring-matches-ring-only] [--complete-rings-only] [--select SELECT] [--timeout SECONDS] [--times] [-v] filename

Find the maximum common substructure of a set of structures

positional arguments:
filename SDF or SMILES file
optional arguments:
-h, --help show this help message and exit
--maximize {atoms,bonds}
Maximize the number of 'atoms' or 'bonds' in the MCS. (Default: bonds)
--min-num-atoms INT
 Minimimum number of atoms in the MCS (Default: 2)
--atom-compare {any,elements}
Specify the atom comparison method. With 'any', every atom matches every other atom. With 'elements', atoms match only if they contain the same element. (Default: element)
--bond-compare {any,ignore-aromaticity,bondtypes}
Specify the bond comparison method. With 'any', every bond matches every other bond. With 'ignore- aromaticity', aromatic bonds match single, aromatic, and double bonds, but no other types match each other. With 'bondtypes', bonds are the same only if their bond type is the same. (Default: bondtypes)
--compare {topology,elements,ignore-aromaticity,types}
Use 'topology' as a shorthand for '--atom-compare any --bond-compare any', 'elements' is '--atom-compare elements --bond-compare any', 'ignore-aromaticity' is '--atom-compare elements --bond-compare ignore- aromaticity', and 'types' is '--atom-compare elements --bond-compare bondtypes' (Default: types)
--ring-matches-ring-only
 Modify the bond comparison so that ring bonds only match ring bonds and chain bonds only match chain bonds. (Ring atoms can still match non-ring atoms.)
--complete-rings-only
 If a bond is a ring bond in the input structures and a bond is in the MCS then the bond must also be in a ring in the MCS. Selecting this option also enables --ring-matches-ring-only.
--select SELECT
 Select a subset of the input records to process. Example: 1-10,13,20,50- (Default: '1-', which selects all structures)
--timeout SECONDS
 Report the best solution after running for at most 'timeout' seconds. Use 'none' for no timeout. (Default: none)
--times Print timing information to stderr
-v, --verbose Print progress statistics to stderr. Use twice for more verbosity.

EXAMPLES

Find the benzotriazole core of a set of structures using the default atom typing scheme, where atoms with the same element and aromaticity match and bonds with the same type match.

% fmcs sample_files/benzotriazole.sdf --verbose Loaded 3669 structures from sample_files/benzotriazole.sdf [#7]:1:[#7]:[#7]:[#6]:2:[#6]:[#6]:[#6]:[#6]:[#6]:1:2 9 atoms 10 bonds (complete search) Total time 5.89 seconds: load 2.76 fragment 2.66 select 0.42 enumerate 0.06 (MCS found after 3.13)

I used the "--verbose" flag (or '-v') so you can see how many structures are in the input data set and how long it took. In this case it took about as long to load the structures as it did to find the MCS. This is because the 'fragment' step was enough to identify the MCS.

Do the same for the ar_clustered_3D_MM_3 data set:

% fmcs sample_files/ar_clustered_3D_MM_3.sdf --verbose Loaded 33 structures from sample_files/ar_clustered_3D_MM_3.sdf [#6]-[#6]-[#6]-[#6]-[#6]-[#6]-[#6]-[#6]-[#6](-[#6]-[#6]-[#6]-[#6])-[#6] 14 atoms 13 bonds (complete search) Total time 0.12 seconds: load 0.03 fragment 0.02 select 0.00 enumerate 0.07 (MCS found after 0.09)

Here the MCS was about 3 times longer than the load time, and most of the time was spent in running the enumeration stage of the MCS algorithm.

Compare the previous result to the MCS found by completely ignoring atom and bond type information:

% fmcs sample_files/ar_clustered_3D_MM_3.sdf --compare topology --verbose Loaded 33 structures from sample_files/ar_clustered_3D_MM_3.sdf [*]~1~[*]~[*]~[*]~[*]~2~[*]~[*]~[*]~3~[*](~[*]~1~2)~[*]~[*]~[*]~1~[*]~[*]~[*]~[*]~1~3 17 atoms 20 bonds (complete search) Total time 0.84 seconds: load 0.03 fragment 0.02 select 0.00 enumerate 0.79 (MCS found after 0.74)

This took longer because it's more difficult to prune the MCS search space when any atom and bond can match any other atom and bond.

Here's the same search but this time the element types are important but the bond types are not important (for example, a triple bond may match a single bond):

% fmcs sample_files/ar_clustered_3D_MM_3.sdf --compare elements --verbose Loaded 33 structures from sample_files/ar_clustered_3D_MM_3.sdf [#6]~1~[#6]~[#6]~[#6]~[#6]~2~[#6]~[#6]~[#6]~3~[#6](~[#6]~1~2)~[#6]~[#6]~[#6]~1~[#6]~[#6]~[#6]~[#6]~1~3 17 atoms 20 bonds (complete search) Total time 0.54 seconds: load 0.03 fragment 0.02 select 0.00 enumerate 0.49 (MCS found after 0.49)

The final "--compare" option is "ignore-aromaticity", which says that two atoms are the same if their elements are the same and two bonds match if:

  • one bond is a single bonds and the other a single or aromatic bonds
  • one bond is an aromatic bond and the other a single, aromatic, or double bond
  • one bond is a double bonds and the other an aromatic or double bond
  • one bond is a triple bond and the other a triple bond

% fmcs sample_files/ar_clustered_3D_MM_3.sdf --compare ignore-aromaticity --verbose Read 33 structures from test_data/ar_clustered_3D_MM_3.sdf Loaded 33 structures from sample_files/ar_clustered_3D_MM_3.sdf [#6]-,:[#6]-,:[#6]-,:[#6]-,:[#6]-,:[#6]-,:[#6]-,:[#6]-,:[#6](-,:[#6]-,:[#6]-,:[#6]-,:[#6])-,:[#6] 14 atoms 13 bonds (complete search) Total time 0.12 seconds: load 0.03 fragment 0.02 select 0.00 enumerate 0.07 (MCS found after 0.09)

The default settings allow a match to occur anywhere. For example, a ring carbon in one structure might match a chain carbon in another structure. This might give results which aren't as chemically useful, or at least chemically pleasing. Use the --ring-matches-ring-only flag to require that ring atoms only match ring atoms and ring bonds only match ring bonds, and likewise that chain atoms and bonds only match chain atoms and bonds.

Even that might be too generous; if a bond is in a ring in the original structure then you might want that bond to still be in a ring of the MCS. For that case, us the --complete-rings-only option. (This implies the --ring-matches-ring-only option.)

Here's an example of --complete-rings-only in action:

% fmcs sample_files/pdgfrb_clustered_3D_MM_4.sdf [#6]:[#6]:1:[#6]:[#6](-[#6](:[#6]:[#6]):[#6]):[#6]:[#7]:[#6]:1 11 atoms 11 bonds (complete search)

To visualize this SMARTS, see http://smartsview.zbh.uni-hamburg.de/. You'll see it matches an aromatic ring of size 6 with an nitrogen in it. One 'side group' is an aromatic bond to a carbon, the other is an non-ring bond to a fragment of an aromatic ring.

With --complete-rings-only, only the ring of size 6 is kept, along with the non-ring side group.

% fmcs sample_files/pdgfrb_clustered_3D_MM_4.sdf --complete-rings-only [#6]-!@[#6]:1:[#6]:[#7]:[#6]:[#6]:[#6]:1 7 atoms 7 bonds (complete search)

You've probably noticed the "(complete search)" on the output lines. This says that the full exhaustive search was done, leaving no doubt that the substructure found is the maximum common substructure. However, the general MCS search problem takes exponential time, and most people would rather have a good answer in seconds than the perfect answer in hours.

The fmcs algorithm keeps track of the biggest substructure found, while it contines to search for a larger common substructure. The algorithm is interruptable, so if you give it a time out, it will stop after that number of seconds have passed, and report the best solution it found up to that time. It might even be the actual MCS, since often fmcs is excluding other possibilities

For example, this takes about 8 seconds to find the MCS,

% fmcs tests/lengthy.smi -v Loaded 2 structures from tests/lengthy.smi [#6]-[#6]-[#7](-[#6](-[#8]-[#6])=[#8])-[#6](-[#6]:1:[#6]:[#6]:[#6](-[#8]-[#6]):[#6]:[#6]:1)-[#6]:1:[#6]:[#6]:[#6](:[#6]:[#6]:1)-[#8] 23 atoms 24 bonds (complete search) Total time 8.11 seconds: load 0.00 fragment 0.00 select 0.00 enumerate 8.10 (MCS found after 2.21)

but the MCS was found in only 2.2 seconds. I can tell fmcs to stop after 3 seconds, using the --timeout option.

% fmcs tests/lengthy.smi --timeout 3 [#6]-[#6]-[#7](-[#6](-[#8]-[#6])=[#8])-[#6](-[#6]:1:[#6]:[#6]:[#6](-[#8]-[#6]):[#6]:[#6]:1)-[#6]:1:[#6]:[#6]:[#6](:[#6]:[#6]:1)-[#8] 23 atoms 24 bonds (timed out)

By the way, if you want to see more progress information during process - I know I don't like seeing a terminal with no output - you can use the "--verbose" flag twice. I added it just before the 1.0 release and I'm not sure if the output is fully meaninful. For now, just think of it as comforting pixels showing that fmcs is working hard.