Commits

Anonymous committed af016fd

Lots more documentation!

Comments (0)

Files changed (1)

 
 Here is the command-line usage using "fmcs --help"
 
-
 usage: fmcs [-h] [--maximize {atoms,bonds}] [--min-num-atoms INT]
-            [--atom-compare {any,elements}]
-            [--bond-compare {any,ignore-aromaticity,bondtypes}]
-            [--compare {topology,elements,ignore-aromaticity,types}]
+            [--compare {topology,elements,types}]
+            [--atom-compare {any,elements,isotopes}]
+            [--bond-compare {any,bondtypes}] [--atom-class-tag TAG]
             [--ring-matches-ring-only] [--complete-rings-only]
-            [--select SELECT] [--timeout SECONDS] [--times] [-v]
+            [--select SELECT] [--timeout SECONDS] [--output FILENAME]
+            [--output-format {smarts,fragment-smiles,fragment-sdf,complete-sdf}]
+            [--output-all-structures] [--save-atom-class-tag TAG]
+            [--save-counts-tag TAG] [--save-atom-indices-tag TAG]
+            [--save-smarts-tag TAG] [--save-smiles-tag TAG] [--times] [-v]
+            [--version]
             filename
 
 Find the maximum common substructure of a set of structures
                         Maximize the number of 'atoms' or 'bonds' in the MCS.
                         (Default: bonds)
   --min-num-atoms INT   Minimimum number of atoms in the MCS (Default: 2)
-  --atom-compare {any,elements}
+  --compare {topology,elements,types}
+                        Use 'topology' as a shorthand for '--atom-compare any
+                        --bond-compare any', 'elements' is '--atom-compare
+                        elements --bond-compare any', and 'types' is '--atom-
+                        compare elements --bond-compare bondtypes' (Default:
+                        types)
+  --atom-compare {any,elements,isotopes}
                         Specify the atom comparison method. With 'any', every
                         atom matches every other atom. With 'elements', atoms
-                        match only if they contain the same element. (Default:
-                        element)
-  --bond-compare {any,ignore-aromaticity,bondtypes}
+                        match only if they contain the same element. With
+                        'isotopes', atoms match only if they have the same
+                        isotope number; element information is ignored so [5C]
+                        and [5P] are identical. This can be used to implement
+                        user-defined atom typing. (Default: elements)
+  --bond-compare {any,bondtypes}
                         Specify the bond comparison method. With 'any', every
-                        bond matches every other bond. With 'ignore-
-                        aromaticity', aromatic bonds match single, aromatic,
-                        and double bonds, but no other types match each other.
-                        With 'bondtypes', bonds are the same only if their
-                        bond type is the same. (Default: bondtypes)
-  --compare {topology,elements,ignore-aromaticity,types}
-                        Use 'topology' as a shorthand for '--atom-compare any
-                        --bond-compare any', 'elements' is '--atom-compare
-                        elements --bond-compare any', 'ignore-aromaticity' is
-                        '--atom-compare elements --bond-compare ignore-
-                        aromaticity', and 'types' is '--atom-compare elements
-                        --bond-compare bondtypes' (Default: types)
+                        bond matches every other bond. With 'bondtypes', bonds
+                        are the same only if their bond types are the same.
+                        (Default: bondtypes)
+  --atom-class-tag TAG  Use atom class assignments from the field 'TAG'. The
+                        tag data must contain a space separated list of
+                        integers in the range 1-10000, one for each atom.
+                        Atoms are identical if and only if their corresponding
+                        atom classes are the same. Note that '003' and '3' are
+                        treated as identical values. (Not used by default)
   --ring-matches-ring-only
                         Modify the bond comparison so that ring bonds only
                         match ring bonds and chain bonds only match chain
   --timeout SECONDS     Report the best solution after running for at most
                         'timeout' seconds. Use 'none' for no timeout.
                         (Default: none)
+  --output FILENAME, -o FILENAME
+                        Write the results to FILENAME (Default: use stdout)
+  --output-format {smarts,fragment-smiles,fragment-sdf,complete-sdf}
+                        'smarts' writes the SMARTS pattern including the atom
+                        and bond criteria. 'fragment-smiles' writes a matching
+                        fragment as a SMILES string. 'fragment-sdf' writes a
+                        matching fragment as a SD file; see --save-atom-class
+                        for details on how atom class information is saved.
+                        'complete-sdf' writes the entire SD file with the
+                        fragment information stored in the tag specified by
+                        --save-fragment-indices-tag. (Default: smarts)
+  --output-all-structures
+                        By default the structure output formats only show an
+                        MCS for the first input structure. If this option is
+                        enabled then an MCS for all of the structures are
+                        shown.
+  --save-atom-class-tag TAG
+                        If atom classes are specified (via --class-tag) and
+                        the output format is 'fragment-sdf' then save the
+                        substructure atom classes to the tag TAG, in fragment
+                        atom order. By default this is the value of --atom-
+                        class-tag.
+  --save-counts-tag TAG
+                        Save the fragment count, atom count, and bond count to
+                        the specified SD tag as space separated integers, like
+                        '1 9 8'. (The fragment count will not be larger than 1
+                        until fmcs supports disconnected MCSes.)
+  --save-atom-indices-tag TAG
+                        If atom classes are specified and the output format is
+                        'complete-sdf' then save the MCS fragment atom indices
+                        to the tag TAG, in MCS order. (Default: mcs-atom-
+                        indices)
+  --save-smarts-tag TAG
+                        Save the MCS SMARTS to the specified SD tag. Uses '-'
+                        if there is no MCS
+  --save-smiles-tag TAG
+                        Save the fragment SMILES to the specified SD tag. Uses
+                        '-' if there is no MCS
   --times               Print timing information to stderr
   -v, --verbose         Print progress statistics to stderr. Use twice for
-                        more verbosity.
+                        higher verbosity.
+  --version             show program's version number and exit
+
+
 
 
 EXAMPLES
 ========
 
+Select atom and bond comparison method
+--------------------------------------
 
 Find the benzotriazole core of a set of structures using the default
 atom typing scheme, where atoms with the same element and aromaticity
 This took longer because it's more difficult to prune the MCS search
 space when any atom and bond can match any other atom and bond.
 
+You should wonder why there is a difference. It looks like the
+benzotriazole data set also contains benzotriazole-like structures,
+where some of the bond types are different than I expected that they
+would be.
 
-Here's the same search but this time the element types are important
-but the bond types are not important (for example, a triple bond may
-match a single bond):
 
-% fmcs sample_files/ar_clustered_3D_MM_3.sdf --compare elements --verbose
-Loaded 33 structures from sample_files/ar_clustered_3D_MM_3.sdf     
-[#6]~1~[#6]~[#6]~[#6]~[#6]~2~[#6]~[#6]~[#6]~3~[#6](~[#6]~1~2)~[#6]~[#6]~[#6]~1~[#6]~[#6]~[#6]~[#6]~1~3 17 atoms 20 bonds (complete search)
-Total time 0.54 seconds: load 0.03 fragment 0.02 select 0.00 enumerate 0.49 (MCS found after 0.49)
+Generating fragment SMILES
+--------------------------
 
+The default output is a single line sent to stdout. It will be in one
+of three formats, depending on if an MCS was found and if the the MCS
+search timed out:
 
-The final "--compare" option is "ignore-aromaticity", which says that
-two atoms are the same if their elements are the same and two bonds
-match if:
+% fmcs sample_files/p38_clustered_3D_MM_9.sdf 
+[#6]:[#6](:[#6]:[#6]:[#6])-[#6]:[#6]-[#6]:[#6]:[#6]:[#7]:[#6] 12 atoms 11 bonds (complete search)
 
-  - one bond is a single bonds and the other a single or aromatic bonds
-  - one bond is an aromatic bond and the other a single, aromatic, or double bond
-  - one bond is a double bonds and the other an aromatic or double bond
-  - one bond is a triple bond and the other a triple bond
+% fmcs sample_files/egfr_clustered_3D_MM_2.sdf
+No MCS found
 
+% fmcs tests/lengthy.smi --timeout 3
+[#6]-[#8]-[#6]:1:[#6]:[#6]:[#6](-[#6](-[#6]:2:[#6]:[#6]:[#6](:[#6]:[#6]:2)-[#8])-[#7](-[#6]-[#6])-[#6](-[#8]-[#6])=[#8]):[#6]:[#6]:1 23 atoms 24 bonds (timed out)
 
+If an MCS is found then the line is the SMARTS followed by the number
+of atom, the word "atoms", the number of bonds, and the word
+"bonds". If the timeout was reached then the phrase "(timed out)" is
+added to the end, otherwise it is a "(complete search)".
 
-% fmcs sample_files/ar_clustered_3D_MM_3.sdf --compare ignore-aromaticity --verbose
-Read 33 structures from test_data/ar_clustered_3D_MM_3.sdf
-Loaded 33 structures from sample_files/ar_clustered_3D_MM_3.sdf     
-[#6]-,:[#6]-,:[#6]-,:[#6]-,:[#6]-,:[#6]-,:[#6]-,:[#6]-,:[#6](-,:[#6]-,:[#6]-,:[#6]-,:[#6])-,:[#6] 14 atoms 13 bonds (complete search)
-Total time 0.12 seconds: load 0.03 fragment 0.02 select 0.00 enumerate 0.07 (MCS found after 0.09)
+That's the output you get with the default "--output-format" of "smarts".
 
+With the "fragment-smiles" output format, the output is the SMILES for
+the fragment and the record identifier. (Note: if there is no MCS then
+the line will be a blank character followed by the id.)
 
+% fmcs sample_files/p38_clustered_3D_MM_9.sdf --output-format fragment-smiles
+cccc(c)-cc-cccnc ZINC03832128
 
-The default settings allow a match to occur anywhere. For example, a
-ring carbon in one structure might match a chain carbon in another
-structure. This might give results which aren't as chemically useful,
-or at least chemically pleasing. Use the --ring-matches-ring-only flag
-to require that ring atoms only match ring atoms and ring bonds only
-match ring bonds, and likewise that chain atoms and bonds only match
-chain atoms and bonds.
+
+Normally this reports a match based on one record. You can get a
+fragment SMILES for all of the input records using "--output-all"
+
+% fmcs sample_files/p38_clustered_3D_MM_9.sdf --output-format fragment-smiles --output-all
+cccc(c)-cc-cccnc ZINC03832128
+cccc(c)-cc-cccnc ZINC03815736
+cccc(c)-cc-cccnc ZINC03832064
+cccc(c)-cc-cccnc ZINC03815693
+   ...
+
+but that's pretty boring when using the the default comparison. The
+"--compare elements" option (which ignores bond types) gives a bigger
+MCS, and with a bit of post-processing I'll list the unique fragment
+SMILES matched by that MCS.
+
+% fmcs.py sample_files/p38_clustered_3D_MM_9.sdf --compare elements --output-format fragment-smiles \
+       --output-all | awk '{print $1}' | sort -u
+cc-c(cc-c)c-ccccnc
+cc-cc(-cc)ccCCC[NH2+]C
+cccc(cc)-cc-cccnc
+
+You can easily see that the elements match but that bond types were ignored.
+
+
+Independently specify --atom-compare and --bond-compare options
+---------------------------------------------------------------
+
+You can specify the base atom and bond comparisons 
+
+The "--compare" option is a shortcut. You can specify the atom and
+bond comparison directly using "--atom-compare" and "--bond-compare".
+
+Here I explicitly use the default of "--atom-compare elements":
+
+% fmcs --atom-compare elements sample_files/na_clustered_3D_MM_1.sdf 
+[#6]-[#6]-[#7]-[#6] 4 atoms 3 bonds (complete search)
+
+Or, here I use "--atom-compare any" to see subgraphs with the same
+bond pattern.
+
+% fmcs --atom-compare any sample_files/na_clustered_3D_MM_1.sdf
+[*]-[*](-[*]-[*])=[*] 5 atoms 4 bonds (complete search)
+
+With the right output options I can even see what the SMARTS pattern matches:
+
+% fmcs --atom-compare any --output-all --output-format fragment-smiles sample_files/na_clustered_3D_MM_1.sdf 
+cNC(=O)C ZINC03581099
+cNC(=O)C ZINC03581100
+CNC(=O)C ZINC03833958
+CNC(=O)O ZINC03581810
+CNC(=O)C ZINC04134481
+cNC(=O)C ZINC03833955
+cNC(=O)C ZINC03833968
+cNC(=O)C ZINC03833957
+cNC(=O)C ZINC03833960
+C=C(C)C[O-] ZINC04134482
+CNC(N)=O ZINC03833959
+C=C(C)C[O-] ZINC04134483
+  ...
+
+
+
+Generating SD files and SD tags
+-------------------------------
+
+So far I've shown examples of 
+You can get the result as an SD file in one of two forms, either a "fragmenet-sdf":
+
+% python fmcs.py sample_files/na_clustered_3D_MM_1.sdf --output-format fragment-sdf
+ZINC03581099
+     RDKit          
+
+  4  3  0  0  0  0  0  0  0  0999 V2000
+    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
+    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
+    0.0000    0.0000    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
+    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
+  1  2  1  0
+  2  3  1  0
+  3  4  1  0
+M  END
+> <Cluster>
+1
+
+> <MODEL.CCRATIO>
+1
+
+> <MODEL.SOURCE>
+CORINA 3.44 0027  09.01.2008
+
+> <b_mmffld_Minimization_Converged-OPLS_2005>
+1
+
+> <id>
+ZINC03581099
+
+> <r_mmffld_Potential_Energy-OPLS_2005>
+-34.241
+
+> <r_mmffld_RMS_Derivative-OPLS_2005>
+6.13019e-05
+
+> <source>
+na_clustered_3D_MM.sdf
+
+$$$$
+
+
+or as a "complete-sdf". On its own, "complete-sdf" is pretty
+useless. It's the same structure as the input, except that it's been
+processed by RDKit. What you likely want with this option is to put
+the match information into an SD tag field. Here I'll omit a lot of
+the SD record, and focuse on the new tag fields.
+
+% fmcs sample_files/na_clustered_3D_MM_1.sdf --output-format complete-sdf \
+    --save-counts-tag MCS_COUNTS --save-smarts-tag MCS_SMARTS --save-smiles-tag MCS_SMILES \
+    --save-atom-indices-tag MCS_ATOMS
+
+ ...
+M  END
+> <Cluster>
+1
+
+> <MCS_ATOMS>
+0 1 3 4
+
+> <MCS_COUNTS>
+1 4 3
+
+> <MCS_SMARTS>
+[#6]-[#6]-[#7]-[#6]
+
+> <MCS_SMILES>
+CC(=O)Nc1c(N)cc(C([O-])=O)cc1
+
+> <MODEL.CCRATIO>
+1
+ ...
+
+
+The "atom indices" tag is the list of atom indices which made up the
+MCS. In this case, the carbon at index 0 is the first atom of the MCS,
+the carbon at index 1 is the second, the nitrogen at index 3 is the
+third, and the carbon at index 4 is the fourth atom of the MCS.
+
+The "counts" tag contains the number of fragments, number of atoms,
+and number of bonds which made up the MCS. With this version of fmcs
+the number of fragments is either 0 or 1, but future versions may
+support disconnected MCSes.
+
+The "smarts" and "smiles" tags contain the fragment SMARTS and SMILES.
+
+
+Save to a file instead of stdout
+--------------------------------
+
+Use "--output" followed by a filename to save the results to a file,
+rather than to stdout.
+
+% fmcs --compare topology --output-all --output mcs_output.sdf \
+   --output-format complete-sdf --save-smiles-tag fmcs_smiles sample_files/p38_clustered_3D_MM_9.sdf
+
+% grep --after 1 fmcs_smiles mcs_output.sdf | head -9
+> <fmcs_smiles>
+c1c(-c2ccccc2)[nH]c(-c2cnccc2)c1-c1ccncc1
+--
+> <fmcs_smiles>
+Fc1ccc(-c2[nH]c(-c3ccccc3)nc2-c2ccncc2)cc1
+--
+> <fmcs_smiles>
+Nc1ccc(-c2cc(-c3ccncc3)c(-c3ccccc3)[nH]2)cc1
+--
+
+
+Modify the bond comparisons
+---------------------------
+
+The "--ring-matches-ring-only" option modifies the bond comparison so
+that ring bonds only match ring bonds and chain bonds only match chain
+bond. (A "chain bond" is one which isn't in a ring.)
 
 Even that might be too generous; if a bond is in a ring in the
-original structure then you might want that bond to still be in a ring
-of the MCS. For that case, us the --complete-rings-only option. (This
-implies the --ring-matches-ring-only option.)
+original structure, and it's part of the MCS, then you might want that
+bond to still be in a ring of the MCS. For that case, use the
+--complete-rings-only option. (This implies --ring-matches-ring-only.)
 
-Here's an example of --complete-rings-only in action:
+I don't have a good dataset to show why you would want to use one of
+these options, in part because most of the data sets contain similar
+structures and --ring-matches-ring-only doesn't give a different
+result. I can show you the difference --complete-rings-only makes:
 
-% fmcs sample_files/pdgfrb_clustered_3D_MM_4.sdf
-[#6]:[#6]:1:[#6]:[#6](-[#6](:[#6]:[#6]):[#6]):[#6]:[#7]:[#6]:1 11 atoms 11 bonds (complete search)
+% fmcs --output-format fragment-smiles sample_files/pdgfrb_clustered_3D_MM_4.sdf
+ccc(c)-c1cc(c)cnc1 ZINC03832246
 
-To visualize this SMARTS, see http://smartsview.zbh.uni-hamburg.de/.
-You'll see it matches an aromatic ring of size 6 with an nitrogen in
-it. One 'side group' is an aromatic bond to a carbon, the other is an
-non-ring bond to a fragment of an aromatic ring.
+% fmcs --output-format fragment-smiles --complete-rings-only sample_files/pdgfrb_clustered_3D_MM_4.sdf
+c-c1cccnc1 ZINC03832246
 
-With --complete-rings-only, only the ring of size 6 is kept, along
-with the non-ring side group.
+You can see how the first includes a partial ring (the "ccc(c)") and a
+complete ring, while the second only includes the complete
+ring. 
 
-% fmcs sample_files/pdgfrb_clustered_3D_MM_4.sdf --complete-rings-only
-[#6]-!@[#6]:1:[#6]:[#7]:[#6]:[#6]:[#6]:1 7 atoms 7 bonds (complete search)
+This second output also shows an interesting nuance. The
+--ring-matches-ring-only option requires the ring bonds only match
+ring bonds, but it says nothing about the atoms. In this case there
+are two rings, with a single bond connecting them. The atoms on either
+side of the single bond are in rings, but the bond itself is not.
 
+A stricter version of this might require that ring atoms also only
+match ring atoms, but then "C1CCCCC1O" and "CO" would not find the MCS
+of "CO."
 
 
-You've probably noticed the "(complete search)" on the output
-lines. This says that the full exhaustive search was done, leaving no
-doubt that the substructure found is the maximum common
-substructure. However, the general MCS search problem takes
-exponential time, and most people would rather have a good answer in
-seconds than the perfect answer in hours.
+Times and timeouts
+------------------
 
-The fmcs algorithm keeps track of the biggest substructure found,
-while it contines to search for a larger common substructure. The
-algorithm is interruptable, so if you give it a time out, it will stop
-after that number of seconds have passed, and report the best solution
-it found up to that time. It might even be the actual MCS, since often
-fmcs is excluding other possibilities
+The fmcs algorithm finds progressively larger common structures in its
+search to find the MCS. Searches can take a very long time to
+complete. Use the "--timeout" option to tell fmcs to stop after a
+certain number of seconds and report the largest structure found up to
+that point.
 
+% fmcs --output-format fragment-smiles --times tests/lengthy.smi 
+CCN(C(=O)OC)C(c1ccc(O)cc1)c1ccc(OC)cc1 CHEMBL311765
+Total time 6.18 seconds: load 0.00 fragment 0.00 select 0.00 enumerate 6.18 (MCS found after 0.53)
 
+I used the "--times" option, which displays to stderr how long
+different aspects of the program took. I described the format of this
+line earlier when I talked about the --verbose flag. The --verbose
+flag automatically toggles the --times option.
 
-For example, this takes about 8 seconds to find the MCS,
+Here it says that the complete search finished after 6.18 seconds, but
+the MCS was found in only 0.53 seconds. This means most of the time is
+spent proving that the current best common substructure really is the
+maximum common substructure. (And it tells me, the developer, that I
+ought to implement some of the other search space pruning methods
+which are possible.)
 
-% fmcs tests/lengthy.smi -v
+My observation is that for lengthy searches like this, either the MCS
+is found quickly, and most of the time is spent verifying that it's
+really the MCS, or that the quickly-found common substructure is close
+to the MCS is size. If you're optimistic, you can set the --timeout to
+stop after a certain time with the current best match.
+
+
+% fmcs --timeout 1 --output-format fragment-smiles --times tests/lengthy.smi
+CCN(C(=O)OC)C(c1ccc(O)cc1)c1ccc(OC)cc1 CHEMBL311765
+Total time 1.01 seconds: load 0.00 fragment 0.00 select 0.00 enumerate 1.00
+
+Verbosity
+---------
+
+If you want to see more progress information during process - I know I
+don't like seeing a terminal with no output - you can use the
+"--verbose" flag twice. Since "--verbose --verbose" is rather lengthy,
+you might use the single letter version "-v -v", or the even more
+opaque "-vv" version.
+
+
+% fmcs -vv tests/lengthy.smi
 Loaded 2 structures from tests/lengthy.smi     
-[#6]-[#6]-[#7](-[#6](-[#8]-[#6])=[#8])-[#6](-[#6]:1:[#6]:[#6]:[#6](-[#8]-[#6]):[#6]:[#6]:1)-[#6]:1:[#6]:[#6]:[#6](:[#6]:[#6]:1)-[#8] 23 atoms 24 bonds (complete search)
-Total time 8.11 seconds: load 0.00 fragment 0.00 select 0.00 enumerate 8.10 (MCS found after 2.21)
+Best after 0.0s: 2 atoms 1 bonds [#6]-[#8]
+Best after 0.0s: 3 atoms 2 bonds [#6]-[#8]-[#6]
+Best after 0.0s: 4 atoms 3 bonds [#6]-[#8]-[#6]:[#6]
+Best after 0.0s: 5 atoms 4 bonds [#6]-[#8]-[#6](:[#6]):[#6]
+Best after 0.0s: 6 atoms 5 bonds [#6]-[#8]-[#6](:[#6]:[#6]):[#6]
+  ....
+Best after 0.2s: 21 atoms 23 bonds [#6]-[#8]-[#6](-[#7]-1-[#6]-[#6]-[#6]:2:[#6]:[#6]:[#6]:[#6]:[#6]:2-[#6]-1-[#6]:1:[#6]:[#6]:[#6](:[#6]:[#6]:1)-[#8])=[#8]
+Best after 0.2s: 22 atoms 24 bonds [#6]-[#8]-[#6](-[#7]-1-[#6]-[#6]-[#6]:2:[#6]:[#6](:[#6]:[#6]:[#6]:2-[#6]-1-[#6]:1:[#6]:[#6]:[#6](:[#6]:[#6]:1)-[#8])-[#8])=[#8]
+Best after 0.5s: 23 atoms 24 bonds [#6]-[#8]-[#6]:1:[#6]:[#6]:[#6](-[#6](-[#6]:2:[#6]:[#6]:[#6](:[#6]:[#6]:2)-[#8])-[#7](-[#6]-[#6])-[#6](-[#8]-[#6])=[#8]):[#6]:[#6]:1
+1493 tests of 968 unique SMARTS, cache: 206 True 319 False, search: 268 True 700 False (968 substructure tests)
+  465 subgraphs enumerated, 407 processed
+2899 tests of 1819 unique SMARTS, cache: 408 True 672 False, search: 491 True 1328 False (1819 substructure tests)
+   ....
+9012 tests of 5714 unique SMARTS, cache: 1070 True 2228 False, search: 1512 True 4202 False (5714 substructure tests)
+  2573 subgraphs enumerated, 2573 processed
+[#6]-[#8]-[#6]:1:[#6]:[#6]:[#6](-[#6](-[#6]:2:[#6]:[#6]:[#6](:[#6]:[#6]:2)-[#8])-[#7](-[#6]-[#6])-[#6](-[#8]-[#6])=[#8]):[#6]:[#6]:1 23 atoms 24 bonds (complete search)
+Total time 6.20 seconds: load 0.00 fragment 0.00 select 0.00 enumerate 6.19 (MCS found after 0.54)
 
 
-but the MCS was found in only 2.2 seconds. I can tell fmcs to stop
-after 3 seconds, using the --timeout option.
+It writes information to stderr every time a better substructure is
+found, and gives progress statistics every second. I don't know if
+it's generally meaningful to others, but I like the warm fuzzy feeling
+of seeing occasional progress information. If a program gives no no
+output for a long period of time then I wonder if it's frozen for some
+reason.
 
 
-% fmcs tests/lengthy.smi --timeout 3
-[#6]-[#6]-[#7](-[#6](-[#8]-[#6])=[#8])-[#6](-[#6]:1:[#6]:[#6]:[#6](-[#8]-[#6]):[#6]:[#6]:1)-[#6]:1:[#6]:[#6]:[#6](:[#6]:[#6]:1)-[#8] 23 atoms 24 bonds (timed out)
+Isotopes and atom classes in SMILES
+-----------------------------------
 
+If you looked carefully at the --atom-compare options you'll see that
+you can compare atoms by isotope number.
 
-By the way, if you want to see more progress information during
-process - I know I don't like seeing a terminal with no output - you
-can use the "--verbose" flag twice. I added it just before the 1.0
-release and I'm not sure if the output is fully meaninful. For now,
-just think of it as comforting pixels showing that fmcs is working
-hard.
+That's a very strange ability, yes?
 
+It's available because in SMARTS the isotope number is essentially the
+only parameter which doesn't affect the chemistry, and because
+(nearly) every cheminformatics toolkit lets you use an relatively
+arbitrary integer in that field. For example, RDKit easily lets you
+have isotope weights from 1 to 1000. (Note that the isotope value "0"
+has the special meaning of "unspecified/natural isotopic abundance".)
+
+Suppose you want halogens to be considered identical, and all other
+atoms to be compared based on the element number, and you have a
+SMILES file which looks like this:
+
+C1CCC(I)CC1 arom1
+C1C(Cl)CC(F)CC1 arom2
+C1CC(F)C(F)CC1 arom3
+C1CC(Br)C(I)C(C)C1 arom4
+C1CCC(I)C(F)C1 arom5
+C1C(Cl)C(F)C(C)C(I)C1 arom6
+
+(This file is in the distribution under sample_files/smsd_arom.smi )
+
+First you need to set the isotope number. Here's a program to set the
+halogens to class 9 (that is, all halogens have isotope 9) and set
+everything else to an isotope equal to its atomic number. Remember,
+we're treating the molecular graph as a graph, and not a molecule!
+
+
+################# halogen_classes.py
+import sys
+from rdkit import Chem
+
+input_filename = sys.argv[1]
+output_filename = sys.argv[2]
+if not input_filename.endswith(".smi"):
+    raise SystemExit("input filename must be a SMILES file with extension '.smi'")
+if not output_filename.endswith(".smi"):
+    raise SystemExit("output SMILES filename must end with the extension '.smi'")
+
+writer = Chem.SmilesWriter(output_filename, includeHeader=False, isomericSmiles=True)
+for mol in Chem.SmilesMolSupplier(input_filename):
+  for atom in mol.GetAtoms():
+    if atom.GetAtomicNum() in (9, 17, 35, 53, 85):
+      atom.SetMass(9)
+    else:
+      atom.SetMass(atom.GetAtomicNum())
+  writer.write(mol)
+writer.close()
+#################
+
+
+When I run this program I get:
+
+% python halogen_classes.py sample_files/smsd_arom.smi test.smi
+% cat test.smi 
+[9F][6CH]1[6CH2][6CH2][6CH2][6CH]([9Cl])[6CH2]1 arom2
+[9F][6CH]1[6CH2][6CH2][6CH2][6CH2][6CH]1[9F] arom3
+[6CH3][6CH]1[6CH2][6CH2][6CH2][6CH]([9Br])[6CH]1[9I] arom4
+[9F][6CH]1[6CH2][6CH2][6CH2][6CH2][6CH]1[9I] arom5
+[6CH3][6CH]1[6CH]([9F])[6CH]([9Cl])[6CH2][6CH2][6CH]1[9I] arom6
+
+
+And here you see that it made a difference.
+
+% fmcs sample_files/smsd_arom.smi 
+[#6]-1-[#6]-[#6]-[#6]-[#6]-[#6]-1 6 atoms 6 bonds (complete search)
+
+% fmcs --atom-compare isotopes test.smi
+[9*]-[6*]-1-[6*]-[6*]-[6*]-[6*]-[6*]-1 7 atoms 7 bonds (complete search)
+
+And what exactly did it match?
+
+% fmcs --atom-compare isotopes --output-format fragment-smiles --output-all test.smi
+FC1C[CH]CCC1 arom2
+FC1[CH]CCCC1 arom3
+BrC1[CH][CH]CCC1 arom4
+FC1[CH]CCCC1 arom5
+FC1[CH][CH]CC[CH]1 arom6
+
+
+Isotopes and atom classes in SD files
+-------------------------------------
+
+
+The SMILES format is flexible enough to be mistreated this way. The SD
+format is not. Its isotope range is much more limited, which makes
+this option more difficult to use directly.
+
+Instead, specify the atom classes via one of the SD tag data
+fields. I'll use the same SMILES data set as before but save the
+output to an SD file.
+
+
+################# halogen_tag.py
+import sys
+from rdkit import Chem
+
+input_filename = sys.argv[1]
+output_filename = sys.argv[2]
+if not input_filename.endswith(".smi"):
+    raise SystemExit("input filename must be a SMILES file with extension '.smi'")
+if not output_filename.endswith(".sdf"):
+    raise SystemExit("output SD filename must end with the extension '.sdf'")
+
+writer = Chem.SDWriter(output_filename)
+for mol in Chem.SmilesMolSupplier(input_filename):
+  atom_classes = []
+  for atom in mol.GetAtoms():
+    if atom.GetAtomicNum() in (9, 17, 35, 53, 85):
+      atom_classes.append("9")
+    else:
+      atom_classes.append(str(atom.GetAtomicNum()))
+  atom_class_data = " ".join(atom_classes)
+  mol.SetProp("atom_classes", atom_class_data)
+  writer.write(mol)
+writer.close()
+#################
+
+Here's the conversion step.
+
+% python halogen_tag.py sample_files/smsd_arom.smi test.sdf
+
+The tag data is a space separated list of atom classes, as integers
+between 1 and 1000.
+
+% tail test.sdf 
+  6  7  1  0
+  6  8  1  0
+  8  9  1  0
+  8 10  1  0
+ 10  1  1  0
+M  END
+>  <atom_classes>  (5) 
+6 6 9 6 9 6 6 6 9 6
+
+$$$$
+
+You can read the "6 6 9 6 9 6 6 6 9 6" as "carbon, carbon, halogen,
+carbon, halogen, ...". Then run fmcs to get the SMARTS, where the
+isotope numbers match the atom classes.
+
+% fmcs --atom-class-tag atom_classes test.sdf
+[9*]-[6*]-1-[6*]-[6*]-[6*]-[6*]-[6*]-1 7 atoms 7 bonds (complete search)
+
+Since you're using an SD file as input, you probably want an SD file
+as output, rather than this strange isotope SMARTS format. You can get
+the SD file output where the record contains the MCS as a fragment
+(remember to use "--output-all" to get all of the structures, and
+"--output" to specify a filename rather than stdout):
+
+% fmcs --atom-class-tag atom_classes --output-format fragment-sdf test.sdf
+arom2
+     RDKit          
+
+  7  7  0  0  0  0  0  0  0  0999 V2000
+    0.0000    0.0000    0.0000 Cl  0  0  0  0  0  0  0  0  0  0  0  0
+    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
+    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
+    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
+    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
+    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
+    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
+  2  1  1  0
+  3  2  1  0
+  4  3  1  0
+  5  4  1  0
+  6  5  1  0
+  7  6  1  0
+  2  7  1  0
+M  END
+> <atom_classes>
+9 6 6 6 6 6 6
+
+$$$$
+
+
+Notice how the atom class tag ("atom_classes" for this case) was
+modified so it contains the correct list of atom classes for the
+fragment.
+
+Or, use "--output-format complete-sdf" to get the full SD structure. 
+
+
+% fmcs --output-format complete-sdf test.sdf
+usage: fmcs [-h] [--maximize {atoms,bonds}] [--min-num-atoms INT]
+            [--compare {topology,elements,types}]
+            [--atom-compare {any,elements,isotopes}]
+            [--bond-compare {any,bondtypes}] [--atom-class-tag TAG]
+            [--ring-matches-ring-only] [--complete-rings-only]
+            [--select SELECT] [--timeout SECONDS] [--output FILENAME]
+            [--output-format {smarts,fragment-smiles,fragment-sdf,complete-sdf}]
+            [--output-all-structures] [--save-atom-class-tag TAG]
+            [--save-counts-tag TAG] [--save-atom-indices-tag TAG]
+            [--save-smarts-tag TAG] [--save-smiles-tag TAG] [--times] [-v]
+            [--version]
+            filename
+fmcs: error: Using --output-format complete-sdf is useless without at least one of --save-atom-indices-tag, --save-smarts-tag, --save-smiles-tag, or --save-counts-tag
+
+
+Oops! There's no point in doing an MCS search if you don't want to
+know the results of the search. I don't know how you want the results
+stored, and I don't yet know a good default, so I have to require you
+to specify what you want.
+
+I want to know which atoms are in the MCS, as a list of indices, so
+I'll use the "--save-atom-indices-tag":
+
+% fmcs --atom-class-tag atom_classes --output-format complete-sdf --save-atom-indices "mcs indices" test.sdf
+arom2
+     RDKit          3D
+
+  8  8  0  0  0  0  0  0  0  0999 V2000
+    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
+    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
+    0.0000    0.0000    0.0000 Cl  0  0  0  0  0  0  0  0  0  0  0  0
+    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
+    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
+    0.0000    0.0000    0.0000 F   0  0  0  0  0  0  0  0  0  0  0  0
+    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
+    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
+  1  2  1  0
+  2  3  1  0
+  2  4  1  0
+  4  5  1  0
+  5  6  1  0
+  5  7  1  0
+  7  8  1  0
+  8  1  1  0
+M  END
+> <atom_classes>
+6 6 9 6 6 9 6 6
+
+> <mcs indices>
+2 1 0 7 6 4 3
+
+$$$$
+
+
+Let's see, so atom 2 (a halogen .. yep, chlorine) is in the MCS, and
+atoms 1, 0, 7, 6, 4, and 3, all carbons, are also in the MCS. Yep,
+that's what I got before!
+
+Again, remember to use "--output-all" to get all of the structures and
+not just one.
+
+
+Select a subset of the input structures
+---------------------------------------
+
+The --select option lets you which select of the input structure to
+use, instead of using all of the compounds in the data file. For
+example, to use the first 25 records of the benzotriazoles:
+
+% fmcs --select 1-25 sample_files/benzotriazole.sdf
+[#6]-[#6]-[#6]-[#7]-[#6](=[#8])-[#6]-[#7]:1:[#7]:[#7]:[#6]:2:[#6]:[#6]:[#6]:[#6]:[#6]:2:1 16 atoms 17 bonds (complete search)
+
+This is a lot larger than the MCS of the entire data set; that being 9
+atoms and 10 bonds.
+
+This option was added to help figure out which structure didn't have
+the expected MCS. I did a binary search to exclude ranges until I
+found the one which was wrong.
+
+However, that's a laborious way to do it. I don't think this option
+will continue for the future. Also, a limitation is the implementation
+is that it parses all of the structures, even if the structure is
+going to be ignored. Eg, using the last 10 structures of a file takes
+to parse as using all the structures in a file.