Overview

FMCS 1.1 - Find Maximum Common Substructure

Find the MCS of a group (or cluster) of chemical structures and report the result as a SMARTS string. It depends on RDKit cheminformatics toolkit.

More specifically, the MCS found is a common edge subgraph, and not a common induced subgraph. Only connected MCSes are found; if you want support for disconnected subgraphs then feel free to fund me for it.

The project page is https://bitbucket.org/dalke/fmcs/ .

This work was funded by Roche and implemented by Andrew Dalke Scientific AB. The software copyright is held by Dalke Scientific and released to the public under the New/2-clause BSD license. See "COPYING" for details.

ADVERTISING

I do custom software development for computational chemistry and biology. Contact me at dalke@dalkescientific.com if you want to advance fmcs development or need help in customizing it for your needs.

You might also be interested in my chemfp toolkit, which is a set of cross-platform fingerprint generation tools and a high-performance similarity search engine. http://code.google.com/p/chem-fingerprints/

For details about my training courses for cheminformatics researchers who want to be more effective at the programming they need for their resarch, see http://dalkescientific.com/training/ .

INSTALLATION

To install this package:

sudo python setup.py install

For details on how to use the Python 'setup.py' tool, see

http://docs.python.org/install/index.html

USAGE:

Here is the command-line usage using "fmcs --help"

usage: fmcs [-h] [--maximize {atoms,bonds}] [--min-num-atoms INT]
[--compare {topology,elements,types}] [--atom-compare {any,elements,isotopes}] [--bond-compare {any,bondtypes}] [--threshold THRESHOLD] [--atom-class-tag TAG] [--ring-matches-ring-only] [--complete-rings-only] [--select SELECT] [--timeout SECONDS] [--output FILENAME] [--output-format {smarts,fragment-smiles,fragment-sdf,complete-sdf}] [--output-all] [--save-atom-class-tag TAG] [--save-counts-tag TAG] [--save-atom-indices-tag TAG] [--save-smarts-tag TAG] [--save-smiles-tag TAG] [--times] [-v] [--version] filename

Find the maximum common substructure of a set of structures

positional arguments:
filename SDF or SMILES file
optional arguments:
-h, --help show this help message and exit
--maximize {atoms,bonds}
Maximize the number of 'atoms' or 'bonds' in the MCS. (Default: bonds)
--min-num-atoms INT
 Minimimum number of atoms in the MCS (Default: 2)
--compare {topology,elements,types}
Use 'topology' as a shorthand for '--atom-compare any --bond-compare any', 'elements' is '--atom-compare elements --bond-compare any', and 'types' is '--atom- compare elements --bond-compare bondtypes' (Default: types)
--atom-compare {any,elements,isotopes}
Specify the atom comparison method. With 'any', every atom matches every other atom. With 'elements', atoms match only if they contain the same element. With 'isotopes', atoms match only if they have the same isotope number; element information is ignored so [5C] and [5P] are identical. This can be used to implement user-defined atom typing. (Default: elements)
--bond-compare {any,bondtypes}
Specify the bond comparison method. With 'any', every bond matches every other bond. With 'bondtypes', bonds are the same only if their bond types are the same. (Default: bondtypes)
--threshold THRESHOLD
 Minimum structure match threshold. A value of 1.0 means that the common substructure must be in all of the input structures. A value of 0.8 finds the largest substructure which is common to at least 80% of the input structures. (Default: 1.0)
--atom-class-tag TAG
 Use atom class assignments from the field 'TAG'. The tag data must contain a space separated list of integers in the range 1-10000, one for each atom. Atoms are identical if and only if their corresponding atom classes are the same. Note that '003' and '3' are treated as identical values. (Not used by default)
--ring-matches-ring-only
 Modify the bond comparison so that ring bonds only match ring bonds and chain bonds only match chain bonds. (Ring atoms can still match non-ring atoms.)
--complete-rings-only
 If a bond is a ring bond in the input structures and a bond is in the MCS then the bond must also be in a ring in the MCS. Selecting this option also enables --ring-matches-ring-only.
--select SELECT
 Select a subset of the input records to process. Example: 1-10,13,20,50- (Default: '1-', which selects all structures)
--timeout SECONDS
 Report the best solution after running for at most 'timeout' seconds. Use 'none' for no timeout. (Default: none)
--output FILENAME, -o FILENAME
 Write the results to FILENAME (Default: use stdout)
--output-format {smarts,fragment-smiles,fragment-sdf,complete-sdf}
'smarts' writes the SMARTS pattern including the atom and bond criteria. 'fragment-smiles' writes a matching fragment as a SMILES string. 'fragment-sdf' writes a matching fragment as a SD file; see --save-atom-class for details on how atom class information is saved. 'complete-sdf' writes the entire SD file with the fragment information stored in the tag specified by --save-fragment-indices-tag. (Default: smarts)
--output-all By default the structure output formats only show an MCS for the first input structure. If this option is enabled then an MCS for all of the structures are shown.
--save-atom-class-tag TAG
 If atom classes are specified (via --class-tag) and the output format is 'fragment-sdf' then save the substructure atom classes to the tag TAG, in fragment atom order. By default this is the value of --atom- class-tag.
--save-counts-tag TAG
 Save the fragment count, atom count, and bond count to the specified SD tag as space separated integers, like '1 9 8'. (The fragment count will not be larger than 1 until fmcs supports disconnected MCSes.)
--save-atom-indices-tag TAG
 If atom classes are specified and the output format is 'complete-sdf' then save the MCS fragment atom indices to the tag TAG, in MCS order. (Default: mcs-atom- indices)
--save-smarts-tag TAG
 Save the MCS SMARTS to the specified SD tag. Uses '-' if there is no MCS
--save-smiles-tag TAG
 Save the fragment SMILES to the specified SD tag. Uses '-' if there is no MCS
--times Print timing information to stderr
-v, --verbose Print progress statistics to stderr. Use twice for higher verbosity.
--version show program's version number and exit

EXAMPLES

Select atom and bond comparison method

Find the benzotriazole core of a set of structures using the default atom typing scheme, where atoms with the same element and aromaticity match and bonds with the same type match.

% fmcs sample_files/benzotriazole.sdf --verbose Loaded 3669 structures from sample_files/benzotriazole.sdf [#7]:1:[#7]:[#7]:[#6]:2:[#6]:[#6]:[#6]:[#6]:[#6]:1:2 9 atoms 10 bonds (complete search) Total time 5.89 seconds: load 2.76 fragment 2.66 select 0.42 enumerate 0.06 (MCS found after 3.13)

I used the "--verbose" flag (or '-v') so you can see how many structures are in the input data set and how long it took. In this case it took about as long to load the structures as it did to find the MCS. This is because the 'fragment' step was enough to identify the MCS.

Do the same for the ar_clustered_3D_MM_3 data set:

% fmcs sample_files/ar_clustered_3D_MM_3.sdf --verbose Loaded 33 structures from sample_files/ar_clustered_3D_MM_3.sdf [#6]-[#6]-[#6]-[#6]-[#6]-[#6]-[#6]-[#6]-[#6](-[#6]-[#6]-[#6]-[#6])-[#6] 14 atoms 13 bonds (complete search) Total time 0.12 seconds: load 0.03 fragment 0.02 select 0.00 enumerate 0.07 (MCS found after 0.09)

Here the MCS was about 3 times longer than the load time, and most of the time was spent in running the enumeration stage of the MCS algorithm.

Compare the previous result to the MCS found by completely ignoring atom and bond type information:

% fmcs sample_files/ar_clustered_3D_MM_3.sdf --compare topology --verbose Loaded 33 structures from sample_files/ar_clustered_3D_MM_3.sdf [*]~1~[*]~[*]~[*]~[*]~2~[*]~[*]~[*]~3~[*](~[*]~1~2)~[*]~[*]~[*]~1~[*]~[*]~[*]~[*]~1~3 17 atoms 20 bonds (complete search) Total time 0.84 seconds: load 0.03 fragment 0.02 select 0.00 enumerate 0.79 (MCS found after 0.74)

This took longer because it's more difficult to prune the MCS search space when any atom and bond can match any other atom and bond.

You should wonder why there is a difference. It looks like the benzotriazole data set also contains benzotriazole-like structures, where some of the bond types are different than I expected that they would be.

Find the threshold MCS

The normal MCS algorithm finds the largest substructure which is common to all of the input structures. The 'threshold MCS' finds the largest subgraph which is common to some percentage of the input structure. This is specified with the --threshold option, which must be a value between 0.0 and 1.0.

% fmcs sample_files/cox2.sdf [#6]:1:[#6]:[#6]:[#6](-[#7]:2:[#6]:[#6]:[#7]:[#6]:2-[#6]:2:[#6]:[#6]:[#6]:[#6]:[#6]:2):[#6]:[#6]:1 17 atoms 19 bonds (complete search) % fmcs sample_files/cox2.sdf --threshold 0.95 [#6]-[#6]:1:[#6]:[#7](:[#6](-[#6]:2:[#6]:[#6]:[#6]:[#6]:[#6]:2):[#7]:1)-[#6]:1:[#6]:[#6]:[#6]:[#6]:[#6]:1 18 atoms 20 bonds (complete search) % fmcs sample_files/cox2.sdf --threshold 0.90 [#6]-[#6]:1:[#6]:[#7](:[#6](-[#6]:2:[#6]:[#6]:[#6]:[#6]:[#6]:2):[#7]:1)-[#6]:1:[#6]:[#6]:[#6](-[#16](=[#8])=[#8]):[#6]:[#6]:1 21 atoms 23 bonds (complete search)

This is useful when looking at similarity clustering based on approximate methods like fingerprints. These tend to group structural neighbors together, but occasionally include compounds which aren't structurally related. You might use the threshold option to find the MCS which is in at least 90% of the structures, in hopes of ignoring the odd entries. I conjecture that you could look at the MCS sizes at different levels of threshold to determine overall group similarity.

Generating fragment SMILES

The default output is a single line sent to stdout. It will be in one of three formats, depending on if an MCS was found and if the the MCS search timed out:

% fmcs sample_files/p38_clustered_3D_MM_9.sdf [#6]:[#6](:[#6]:[#6]:[#6])-[#6]:[#6]-[#6]:[#6]:[#6]:[#7]:[#6] 12 atoms 11 bonds (complete search)

% fmcs sample_files/egfr_clustered_3D_MM_2.sdf No MCS found

% fmcs tests/lengthy.smi --timeout 3 [#6]-[#8]-[#6]:1:[#6]:[#6]:[#6](-[#6](-[#6]:2:[#6]:[#6]:[#6](:[#6]:[#6]:2)-[#8])-[#7](-[#6]-[#6])-[#6](-[#8]-[#6])=[#8]):[#6]:[#6]:1 23 atoms 24 bonds (timed out)

If an MCS is found then the line is the SMARTS followed by the number of atom, the word "atoms", the number of bonds, and the word "bonds". If the timeout was reached then the phrase "(timed out)" is added to the end, otherwise it is a "(complete search)".

That's the output you get with the default "--output-format" of "smarts".

With the "fragment-smiles" output format, the output is the SMILES for the fragment and the record identifier. (Note: if there is no MCS then the line will be a blank character followed by the id.)

% fmcs sample_files/p38_clustered_3D_MM_9.sdf --output-format fragment-smiles cccc(c)-cc-cccnc ZINC03832128

Normally this reports a match based on one record. You can get a fragment SMILES for all of the input records using "--output-all"

% fmcs sample_files/p38_clustered_3D_MM_9.sdf --output-format fragment-smiles --output-all cccc(c)-cc-cccnc ZINC03832128 cccc(c)-cc-cccnc ZINC03815736 cccc(c)-cc-cccnc ZINC03832064 cccc(c)-cc-cccnc ZINC03815693

...

but that's pretty boring when using the the default comparison. The "--compare elements" option (which ignores bond types) gives a bigger MCS, and with a bit of post-processing I'll list the unique fragment SMILES matched by that MCS.

% fmcs.py sample_files/p38_clustered_3D_MM_9.sdf --compare elements --output-format fragment-smiles
--output-all | awk '{print $1}' | sort -u

cc-c(cc-c)c-ccccnc cc-cc(-cc)ccCCC[NH2+]C cccc(cc)-cc-cccnc

You can easily see that the elements match but that bond types were ignored.

Separately specify --atom-compare and --bond-compare options

The "--compare" option is a shortcut for common, chemically reasonable comparisons. You can specify the atom and bond comparison separately using "--atom-compare" and "--bond-compare", which means you can do some pretty strange (or interesting!) MCS searches.

Here I explicitly use the default of "--atom-compare elements":

% fmcs --atom-compare elements sample_files/na_clustered_3D_MM_1.sdf [#6]-[#6]-[#7]-[#6] 4 atoms 3 bonds (complete search)

Or, here I use "--atom-compare any" to see subgraphs with the same bond pattern.

% fmcs --atom-compare any sample_files/na_clustered_3D_MM_1.sdf [*]-[*](-[*]-[*])=[*] 5 atoms 4 bonds (complete search)

With the right output options I can even see what the SMARTS pattern matches:

% fmcs --atom-compare any --output-all --output-format fragment-smiles sample_files/na_clustered_3D_MM_1.sdf cNC(=O)C ZINC03581099 cNC(=O)C ZINC03581100 CNC(=O)C ZINC03833958 CNC(=O)O ZINC03581810 CNC(=O)C ZINC04134481 cNC(=O)C ZINC03833955 cNC(=O)C ZINC03833968 cNC(=O)C ZINC03833957 cNC(=O)C ZINC03833960 C=C(C)C[O-] ZINC04134482 CNC(N)=O ZINC03833959 C=C(C)C[O-] ZINC04134483

...

Generating SD files and SD tags

So far I've shown examples of generates SMILES output files. You can get the result as an SD file in one of two forms, either a "fragment-sdf":

% python fmcs.py sample_files/na_clustered_3D_MM_1.sdf --output-format fragment-sdf ZINC03581099

RDKit
4 3 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

1 2 1 0 2 3 1 0 3 4 1 0

M END > <Cluster> 1

> <MODEL.CCRATIO> 1

> <MODEL.SOURCE> CORINA 3.44 0027 09.01.2008

> <b_mmffld_Minimization_Converged-OPLS_2005> 1

> <id> ZINC03581099

> <r_mmffld_Potential_Energy-OPLS_2005> -34.241

> <r_mmffld_RMS_Derivative-OPLS_2005> 6.13019e-05

> <source> na_clustered_3D_MM.sdf


or as a "complete-sdf". On its own, "complete-sdf" is pretty useless. It's the same structure as the input, except that it's been processed by RDKit. What you likely want with this option is to put the match information into an SD tag field. Here I'll omit a lot of the SD record, and focuse on the new tag fields.

% fmcs sample_files/na_clustered_3D_MM_1.sdf --output-format complete-sdf
--save-counts-tag MCS_COUNTS --save-smarts-tag MCS_SMARTS --save-smiles-tag MCS_SMILES --save-atom-indices-tag MCS_ATOMS

...

M END > <Cluster> 1

> <MCS_ATOMS> 0 1 3 4

> <MCS_COUNTS> 1 4 3

> <MCS_SMARTS> [#6]-[#6]-[#7]-[#6]

> <MCS_SMILES> CC(=O)Nc1c(N)cc(C([O-])=O)cc1

> <MODEL.CCRATIO> 1

...

The "atom indices" tag is the list of atom indices which made up the MCS. In this case, the carbon at index 0 is the first atom of the MCS, the carbon at index 1 is the second, the nitrogen at index 3 is the third, and the carbon at index 4 is the fourth atom of the MCS.

The "counts" tag contains the number of fragments, number of atoms, and number of bonds which made up the MCS. With this version of fmcs the number of fragments is either 0 or 1, but future versions may support disconnected MCSes.

The "smarts" and "smiles" tags contain the fragment SMARTS and SMILES.

Save to a file instead of stdout

Use "--output" followed by a filename to save the results to a file, rather than to stdout.

% fmcs --compare topology --output-all --output mcs_output.sdf
--output-format complete-sdf --save-smiles-tag fmcs_smiles sample_files/p38_clustered_3D_MM_9.sdf

% grep --after 1 fmcs_smiles mcs_output.sdf | head -9 > <fmcs_smiles> c1c(-c2ccccc2)[nH]c(-c2cnccc2)c1-c1ccncc1 -- > <fmcs_smiles> Fc1ccc(-c2[nH]c(-c3ccccc3)nc2-c2ccncc2)cc1 -- > <fmcs_smiles> Nc1ccc(-c2cc(-c3ccncc3)c(-c3ccccc3)[nH]2)cc1 --

Modify the bond comparisons

The "--ring-matches-ring-only" option modifies the bond comparison so that ring bonds only match ring bonds and chain bonds only match chain bond. (A "chain bond" is one which isn't in a ring.)

Even that might be too generous; if a bond is in a ring in the original structure, and it's part of the MCS, then you might want that bond to still be in a ring of the MCS. For that case, use the --complete-rings-only option. (This implies --ring-matches-ring-only.)

I don't have a good dataset to show why you would want to use one of these options, in part because most of the data sets contain similar structures and --ring-matches-ring-only doesn't give a different result. I can show you the difference --complete-rings-only makes:

% fmcs --output-format fragment-smiles sample_files/pdgfrb_clustered_3D_MM_4.sdf ccc(c)-c1cc(c)cnc1 ZINC03832246

% fmcs --output-format fragment-smiles --complete-rings-only sample_files/pdgfrb_clustered_3D_MM_4.sdf c-c1cccnc1 ZINC03832246

You can see how the first includes a partial ring (the "ccc(c)") and a complete ring, while the second only includes the complete ring.

This second output also shows an interesting nuance. The --ring-matches-ring-only option requires the ring bonds only match ring bonds, but it says nothing about the atoms. In this case there are two rings, with a single bond connecting them. The atoms on either side of the single bond are in rings, but the bond itself is not.

A stricter version of this might require that ring atoms also only match ring atoms, but then "C1CCCCC1O" and "CO" would not find the MCS of "CO."

Times and timeouts

The fmcs algorithm finds progressively larger common structures in its search to find the MCS. Searches can take a very long time to complete. Use the "--timeout" option to tell fmcs to stop after a certain number of seconds and report the largest structure found up to that point.

% fmcs --output-format fragment-smiles --times tests/lengthy.smi CCN(C(=O)OC)C(c1ccc(O)cc1)c1ccc(OC)cc1 CHEMBL311765 Total time 6.18 seconds: load 0.00 fragment 0.00 select 0.00 enumerate 6.18 (MCS found after 0.53)

I used the "--times" option, which displays to stderr how long different aspects of the program took. I described the format of this line earlier when I talked about the --verbose flag. The --verbose flag automatically toggles the --times option.

Here it says that the complete search finished after 6.18 seconds, but the MCS was found in only 0.53 seconds. This means most of the time is spent proving that the current best common substructure really is the maximum common substructure. (And it tells me, the developer, that I ought to implement some of the other search space pruning methods which are possible.)

My observation is that for lengthy searches like this, either the MCS is found quickly, and most of the time is spent verifying that it's really the MCS, or that the quickly-found common substructure is close to the MCS is size. If you're optimistic, you can set the --timeout to stop after a certain time with the current best match.

% fmcs --timeout 1 --output-format fragment-smiles --times tests/lengthy.smi CCN(C(=O)OC)C(c1ccc(O)cc1)c1ccc(OC)cc1 CHEMBL311765 Total time 1.01 seconds: load 0.00 fragment 0.00 select 0.00 enumerate 1.00

Verbosity

If you want to see more progress information during process - I know I don't like seeing a terminal with no output - you can use the "--verbose" flag twice. Since "--verbose --verbose" is rather lengthy, you might use the single letter version "-v -v", or the even more opaque "-vv" version.

% fmcs -vv tests/lengthy.smi Loaded 2 structures from tests/lengthy.smi Best after 0.0s: 2 atoms 1 bonds [#6]-[#8] Best after 0.0s: 3 atoms 2 bonds [#6]-[#8]-[#6] Best after 0.0s: 4 atoms 3 bonds [#6]-[#8]-[#6]:[#6] Best after 0.0s: 5 atoms 4 bonds [#6]-[#8]-[#6](:[#6]):[#6] Best after 0.0s: 6 atoms 5 bonds [#6]-[#8]-[#6](:[#6]:[#6]):[#6]

Best after 0.2s: 21 atoms 23 bonds [#6]-[#8]-[#6](-[#7]-1-[#6]-[#6]-[#6]:2:[#6]:[#6]:[#6]:[#6]:[#6]:2-[#6]-1-[#6]:1:[#6]:[#6]:[#6](:[#6]:[#6]:1)-[#8])=[#8] Best after 0.2s: 22 atoms 24 bonds [#6]-[#8]-[#6](-[#7]-1-[#6]-[#6]-[#6]:2:[#6]:[#6](:[#6]:[#6]:[#6]:2-[#6]-1-[#6]:1:[#6]:[#6]:[#6](:[#6]:[#6]:1)-[#8])-[#8])=[#8] Best after 0.5s: 23 atoms 24 bonds [#6]-[#8]-[#6]:1:[#6]:[#6]:[#6](-[#6](-[#6]:2:[#6]:[#6]:[#6](:[#6]:[#6]:2)-[#8])-[#7](-[#6]-[#6])-[#6](-[#8]-[#6])=[#8]):[#6]:[#6]:1 1493 tests of 968 unique SMARTS, cache: 206 True 319 False, search: 268 True 700 False (968 substructure tests)

465 subgraphs enumerated, 407 processed
2899 tests of 1819 unique SMARTS, cache: 408 True 672 False, search: 491 True 1328 False (1819 substructure tests)
9012 tests of 5714 unique SMARTS, cache: 1070 True 2228 False, search: 1512 True 4202 False (5714 substructure tests)
2573 subgraphs enumerated, 2573 processed

[#6]-[#8]-[#6]:1:[#6]:[#6]:[#6](-[#6](-[#6]:2:[#6]:[#6]:[#6](:[#6]:[#6]:2)-[#8])-[#7](-[#6]-[#6])-[#6](-[#8]-[#6])=[#8]):[#6]:[#6]:1 23 atoms 24 bonds (complete search) Total time 6.20 seconds: load 0.00 fragment 0.00 select 0.00 enumerate 6.19 (MCS found after 0.54)

It writes information to stderr every time a better substructure is found, and gives progress statistics every second. I don't know if it's generally meaningful to others, but I like the warm fuzzy feeling of seeing occasional progress information. If a program gives no no output for a long period of time then I wonder if it's frozen for some reason.

Isotopes and atom classes in SMILES

If you looked carefully at the --atom-compare options you'll see that you can compare atoms by isotope number.

That's a very strange ability, yes?

It's available because in SMARTS the isotope number is essentially the only parameter which doesn't affect the chemistry, and because (nearly) every cheminformatics toolkit lets you use an relatively arbitrary integer in that field. For example, RDKit easily lets you have isotope weights from 1 to 1000. (Note that the isotope value "0" has the special meaning of "unspecified/natural isotopic abundance".)

Suppose you want halogens to be considered identical, and all other atoms to be compared based on the element number, and you have a SMILES file which looks like this:

C1CCC(I)CC1 arom1 C1C(Cl)CC(F)CC1 arom2 C1CC(F)C(F)CC1 arom3 C1CC(Br)C(I)C(C)C1 arom4 C1CCC(I)C(F)C1 arom5 C1C(Cl)C(F)C(C)C(I)C1 arom6

(This file is in the distribution under sample_files/smsd_arom.smi )

First you need to set the isotope number. Here's a program to set the halogens to class 9 (that is, all halogens have isotope 9) and set everything else to an isotope equal to its atomic number. Remember, we're treating the molecular graph as a graph, and not a molecule!

################# halogen_classes.py import sys from rdkit import Chem

input_filename = sys.argv[1] output_filename = sys.argv[2] if not input_filename.endswith(".smi"):

raise SystemExit("input filename must be a SMILES file with extension '.smi'")
if not output_filename.endswith(".smi"):
raise SystemExit("output SMILES filename must end with the extension '.smi'")

writer = Chem.SmilesWriter(output_filename, includeHeader=False, isomericSmiles=True) for mol in Chem.SmilesMolSupplier(input_filename):

for atom in mol.GetAtoms():
if atom.GetAtomicNum() in (9, 17, 35, 53, 85):
atom.SetMass(9)
else:
atom.SetMass(atom.GetAtomicNum())

writer.write(mol)

writer.close()

When I run this program I get (also available as test/smsd_arom_isotopes.smi):

% python halogen_classes.py sample_files/smsd_arom.smi test.smi % cat test.smi [9F][6CH]1[6CH2][6CH2][6CH2][6CH]([9Cl])[6CH2]1 arom2 [9F][6CH]1[6CH2][6CH2][6CH2][6CH2][6CH]1[9F] arom3 [6CH3][6CH]1[6CH2][6CH2][6CH2][6CH]([9Br])[6CH]1[9I] arom4 [9F][6CH]1[6CH2][6CH2][6CH2][6CH2][6CH]1[9I] arom5 [6CH3][6CH]1[6CH]([9F])[6CH]([9Cl])[6CH2][6CH2][6CH]1[9I] arom6

And here you see that it made a difference.

% fmcs sample_files/smsd_arom.smi [#6]-1-[#6]-[#6]-[#6]-[#6]-[#6]-1 6 atoms 6 bonds (complete search)

% fmcs --atom-compare isotopes test.smi [9*]-[6*]-1-[6*]-[6*]-[6*]-[6*]-[6*]-1 7 atoms 7 bonds (complete search)

And what exactly did it match?

% fmcs --atom-compare isotopes --output-format fragment-smiles --output-all test.smi FC1C[CH]CCC1 arom2 FC1[CH]CCCC1 arom3 BrC1[CH][CH]CCC1 arom4 FC1[CH]CCCC1 arom5 FC1[CH][CH]CC[CH]1 arom6

Isotopes and atom classes in SD files

The SMILES format is flexible enough to be mistreated this way. The SD format is not. Its isotope range is much more limited, which makes this option more difficult to use directly.

Instead, specify the atom classes via one of the SD tag data fields. I'll use the same SMILES data set as before but save the output to an SD file.

################# halogen_tag.py import sys from rdkit import Chem

input_filename = sys.argv[1] output_filename = sys.argv[2] if not input_filename.endswith(".smi"):

raise SystemExit("input filename must be a SMILES file with extension '.smi'")
if not output_filename.endswith(".sdf"):
raise SystemExit("output SD filename must end with the extension '.sdf'")

writer = Chem.SDWriter(output_filename) for mol in Chem.SmilesMolSupplier(input_filename):

atom_classes = [] for atom in mol.GetAtoms():

if atom.GetAtomicNum() in (9, 17, 35, 53, 85):
atom_classes.append("9")
else:
atom_classes.append(str(atom.GetAtomicNum()))

atom_class_data = " ".join(atom_classes) mol.SetProp("atom_classes", atom_class_data) writer.write(mol)

writer.close()

Here's the conversion step.

% python halogen_tag.py sample_files/smsd_arom.smi test.sdf

The tag data is a space separated list of atom classes, as integers between 1 and 1000.

% tail test.sdf
6 7 1 0 6 8 1 0 8 9 1 0 8 10 1 0

10 1 1 0

M END > <atom_classes> (5) 6 6 9 6 9 6 6 6 9 6


You can read the "6 6 9 6 9 6 6 6 9 6" as "carbon, carbon, halogen, carbon, halogen, ...". Then run fmcs to get the SMARTS, where the isotope numbers match the atom classes.

% fmcs --atom-class-tag atom_classes test.sdf [9*]-[6*]-1-[6*]-[6*]-[6*]-[6*]-[6*]-1 7 atoms 7 bonds (complete search)

Since you're using an SD file as input, you probably want an SD file as output, rather than this strange isotope SMARTS format. You can get the SD file output where the record contains the MCS as a fragment (remember to use "--output-all" to get all of the structures, and "--output" to specify a filename rather than stdout):

% fmcs --atom-class-tag atom_classes --output-format fragment-sdf test.sdf arom2

RDKit
7 7 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 Cl 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

2 1 1 0 3 2 1 0 4 3 1 0 5 4 1 0 6 5 1 0 7 6 1 0 2 7 1 0

M END > <atom_classes> 9 6 6 6 6 6 6


Notice how the atom class tag ("atom_classes" for this case) was modified so it contains the correct list of atom classes for the fragment.

Or, use "--output-format complete-sdf" to get the full SD structure.

% fmcs --output-format complete-sdf test.sdf usage: fmcs [-h] [--maximize {atoms,bonds}] [--min-num-atoms INT]

[--compare {topology,elements,types}] [--atom-compare {any,elements,isotopes}] [--bond-compare {any,bondtypes}] [--atom-class-tag TAG] [--ring-matches-ring-only] [--complete-rings-only] [--select SELECT] [--timeout SECONDS] [--output FILENAME] [--output-format {smarts,fragment-smiles,fragment-sdf,complete-sdf}] [--output-all-structures] [--save-atom-class-tag TAG] [--save-counts-tag TAG] [--save-atom-indices-tag TAG] [--save-smarts-tag TAG] [--save-smiles-tag TAG] [--times] [-v] [--version] filename

fmcs: error: Using --output-format complete-sdf is useless without at least one of --save-atom-indices-tag, --save-smarts-tag, --save-smiles-tag, or --save-counts-tag

Oops! There's no point in doing an MCS search if you don't want to know the results of the search. I don't know how you want the results stored, and I don't yet know a good default, so I have to require you to specify what you want.

I want to know which atoms are in the MCS, as a list of indices, so I'll use the "--save-atom-indices-tag":

% fmcs --atom-class-tag atom_classes --output-format complete-sdf --save-atom-indices "mcs indices" test.sdf arom2

RDKit 3D
8 8 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 Cl 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 F 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

1 2 1 0 2 3 1 0 2 4 1 0 4 5 1 0 5 6 1 0 5 7 1 0 7 8 1 0 8 1 1 0

M END > <atom_classes> 6 6 9 6 6 9 6 6

> <mcs indices> 2 1 0 7 6 4 3


Let's see, so atom 2 (a halogen .. yep, chlorine) is in the MCS, and atoms 1, 0, 7, 6, 4, and 3, all carbons, are also in the MCS. Yep, that's what I got before!

Again, remember to use "--output-all" to get all of the structures and not just one.

Select a subset of the input structures

The --select option lets you which select of the input structure to use, instead of using all of the compounds in the data file. For example, to use the first 25 records of the benzotriazoles:

% fmcs --select 1-25 sample_files/benzotriazole.sdf [#6]-[#6]-[#6]-[#7]-[#6](=[#8])-[#6]-[#7]:1:[#7]:[#7]:[#6]:2:[#6]:[#6]:[#6]:[#6]:[#6]:2:1 16 atoms 17 bonds (complete search)

This is a lot larger than the MCS of the entire data set; that being 9 atoms and 10 bonds.

This option was added to help figure out which structure didn't have the expected MCS. I did a binary search to exclude ranges until I found the one which was wrong.

However, that's a laborious way to do it. I don't think this option will continue for the future. Also, a limitation is the implementation is that it parses all of the structures, even if the structure is going to be ignored. Eg, using the last 10 structures of a file takes to parse as using all the structures in a file.