Source

fmcs / README

Andrew Dalke a98ecf6 
dalke d4c7ca5 




dalke 6129f6e 



dalke 2b85365 

dalke d4c7ca5 
dalke 09a2532 


dalke d4c7ca5 
dalke 0d66905 











dalke abd8d0f 
dalke e568f8b 

dalke abd8d0f 
dalke 0d66905 
dalke d4c7ca5 















dalke 756f0d8 
dalke d4c7ca5 
dalke 756f0d8 
dalke af016fd 

Andrew Dalke eb89be6 



dalke af016fd 
dalke 7715506 
dalke af016fd 


dalke 756f0d8 
dalke d4c7ca5 








dalke 756f0d8 

dalke af016fd 






dalke 756f0d8 

dalke af016fd 





dalke 756f0d8 
dalke af016fd 


Andrew Dalke eb89be6 





dalke af016fd 





dalke d4c7ca5 
dalke 756f0d8 







dalke d4c7ca5 
dalke 756f0d8 

dalke d4c7ca5 

dalke 756f0d8 
dalke af016fd 










dalke 7715506 
dalke af016fd 
























dalke 756f0d8 

dalke af016fd 



dalke d4c7ca5 




dalke af016fd 

dalke d4c7ca5 




dalke 756f0d8 



dalke d4c7ca5 
dalke 756f0d8 





dalke d4c7ca5 



dalke 756f0d8 



dalke d4c7ca5 
dalke 756f0d8 


dalke d4c7ca5 




dalke 756f0d8 



dalke d4c7ca5 
dalke 756f0d8 

dalke d4c7ca5 
dalke af016fd 



dalke d4c7ca5 

Andrew Dalke eb89be6 
























dalke af016fd 

dalke 756f0d8 
dalke af016fd 


dalke 756f0d8 
dalke af016fd 

dalke 756f0d8 
dalke af016fd 

dalke 756f0d8 
dalke af016fd 

dalke 756f0d8 
dalke af016fd 



dalke 756f0d8 
dalke af016fd 
dalke d4c7ca5 
dalke af016fd 


dalke d4c7ca5 
dalke af016fd 

dalke d4c7ca5 
dalke af016fd 
























dalke 7715506 
dalke af016fd 

dalke 7715506 



dalke af016fd 

































dalke 7715506 

dalke af016fd 
















































































































dalke d4c7ca5 
dalke 756f0d8 
dalke af016fd 


dalke d4c7ca5 
dalke af016fd 



dalke d4c7ca5 
dalke af016fd 

dalke d4c7ca5 
dalke af016fd 

dalke 756f0d8 
dalke af016fd 


dalke 756f0d8 
dalke af016fd 




dalke 756f0d8 
dalke af016fd 


dalke 756f0d8 

dalke af016fd 

dalke 756f0d8 
dalke af016fd 




dalke 756f0d8 
dalke af016fd 


dalke 756f0d8 
dalke af016fd 



dalke 756f0d8 
dalke af016fd 





dalke 756f0d8 
dalke af016fd 










dalke 7715506 
dalke af016fd 










dalke 756f0d8 
dalke af016fd 
















dalke 756f0d8 

dalke af016fd 





dalke 756f0d8 

dalke af016fd 

dalke 756f0d8 
dalke af016fd 

dalke 756f0d8 
dalke af016fd 
dalke 756f0d8 
dalke af016fd 
















































dalke 7715506 
dalke af016fd 











dalke 7715506 
dalke af016fd 









































































































































































































  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
           FMCS 1.1 - Find Maximum Common Substructure

Find the MCS of a group (or cluster) of chemical structures and report
the result as a SMARTS string. It depends on RDKit cheminformatics
toolkit.

More specifically, the MCS found is a common edge subgraph, and not a
common induced subgraph. Only connected MCSes are found; if you want
support for disconnected subgraphs then feel free to fund me for it.

The project page is https://bitbucket.org/dalke/fmcs/ .

This work was funded by Roche and implemented by Andrew Dalke
Scientific AB. The software copyright is held by Dalke Scientific and
released to the public under the New/2-clause BSD license. See
"COPYING" for details.

ADVERTISING
===========

I do custom software development for computational chemistry and
biology. Contact me at dalke@dalkescientific.com if you want to
advance fmcs development or need help in customizing it for your
needs.

You might also be interested in my chemfp toolkit, which is a set of
cross-platform fingerprint generation tools and a high-performance
similarity search engine. http://code.google.com/p/chem-fingerprints/

For details about my training courses for cheminformatics researchers
who want to be more effective at the programming they need for their
resarch, see http://dalkescientific.com/training/ .


INSTALLATION
============

To install this package:

  sudo python setup.py install


For details on how to use the Python 'setup.py' tool, see

  http://docs.python.org/install/index.html

USAGE:
======


Here is the command-line usage using "fmcs --help"

usage: fmcs [-h] [--maximize {atoms,bonds}] [--min-num-atoms INT]
            [--compare {topology,elements,types}]
            [--atom-compare {any,elements,isotopes}]
            [--bond-compare {any,bondtypes}] [--threshold THRESHOLD]
            [--atom-class-tag TAG] [--ring-matches-ring-only]
            [--complete-rings-only] [--select SELECT] [--timeout SECONDS]
            [--output FILENAME]
            [--output-format {smarts,fragment-smiles,fragment-sdf,complete-sdf}]
            [--output-all] [--save-atom-class-tag TAG]
            [--save-counts-tag TAG] [--save-atom-indices-tag TAG]
            [--save-smarts-tag TAG] [--save-smiles-tag TAG] [--times] [-v]
            [--version]
            filename
Find the maximum common substructure of a set of structures

positional arguments:
  filename              SDF or SMILES file

optional arguments:
  -h, --help            show this help message and exit
  --maximize {atoms,bonds}
                        Maximize the number of 'atoms' or 'bonds' in the MCS.
                        (Default: bonds)
  --min-num-atoms INT   Minimimum number of atoms in the MCS (Default: 2)
  --compare {topology,elements,types}
                        Use 'topology' as a shorthand for '--atom-compare any
                        --bond-compare any', 'elements' is '--atom-compare
                        elements --bond-compare any', and 'types' is '--atom-
                        compare elements --bond-compare bondtypes' (Default:
                        types)
  --atom-compare {any,elements,isotopes}
                        Specify the atom comparison method. With 'any', every
                        atom matches every other atom. With 'elements', atoms
                        match only if they contain the same element. With
                        'isotopes', atoms match only if they have the same
                        isotope number; element information is ignored so [5C]
                        and [5P] are identical. This can be used to implement
                        user-defined atom typing. (Default: elements)
  --bond-compare {any,bondtypes}
                        Specify the bond comparison method. With 'any', every
                        bond matches every other bond. With 'bondtypes', bonds
                        are the same only if their bond types are the same.
                        (Default: bondtypes)
  --threshold THRESHOLD
                        Minimum structure match threshold. A value of 1.0
                        means that the common substructure must be in all of
                        the input structures. A value of 0.8 finds the largest
                        substructure which is common to at least 80% of the
                        input structures. (Default: 1.0)
  --atom-class-tag TAG  Use atom class assignments from the field 'TAG'. The
                        tag data must contain a space separated list of
                        integers in the range 1-10000, one for each atom.
                        Atoms are identical if and only if their corresponding
                        atom classes are the same. Note that '003' and '3' are
                        treated as identical values. (Not used by default)
  --ring-matches-ring-only
                        Modify the bond comparison so that ring bonds only
                        match ring bonds and chain bonds only match chain
                        bonds. (Ring atoms can still match non-ring atoms.)
  --complete-rings-only
                        If a bond is a ring bond in the input structures and a
                        bond is in the MCS then the bond must also be in a
                        ring in the MCS. Selecting this option also enables
                        --ring-matches-ring-only.
  --select SELECT       Select a subset of the input records to process.
                        Example: 1-10,13,20,50- (Default: '1-', which selects
                        all structures)
  --timeout SECONDS     Report the best solution after running for at most
                        'timeout' seconds. Use 'none' for no timeout.
                        (Default: none)
  --output FILENAME, -o FILENAME
                        Write the results to FILENAME (Default: use stdout)
  --output-format {smarts,fragment-smiles,fragment-sdf,complete-sdf}
                        'smarts' writes the SMARTS pattern including the atom
                        and bond criteria. 'fragment-smiles' writes a matching
                        fragment as a SMILES string. 'fragment-sdf' writes a
                        matching fragment as a SD file; see --save-atom-class
                        for details on how atom class information is saved.
                        'complete-sdf' writes the entire SD file with the
                        fragment information stored in the tag specified by
                        --save-fragment-indices-tag. (Default: smarts)
  --output-all          By default the structure output formats only show an
                        MCS for the first input structure. If this option is
                        enabled then an MCS for all of the structures are
                        shown.
  --save-atom-class-tag TAG
                        If atom classes are specified (via --class-tag) and
                        the output format is 'fragment-sdf' then save the
                        substructure atom classes to the tag TAG, in fragment
                        atom order. By default this is the value of --atom-
                        class-tag.
  --save-counts-tag TAG
                        Save the fragment count, atom count, and bond count to
                        the specified SD tag as space separated integers, like
                        '1 9 8'. (The fragment count will not be larger than 1
                        until fmcs supports disconnected MCSes.)
  --save-atom-indices-tag TAG
                        If atom classes are specified and the output format is
                        'complete-sdf' then save the MCS fragment atom indices
                        to the tag TAG, in MCS order. (Default: mcs-atom-
                        indices)
  --save-smarts-tag TAG
                        Save the MCS SMARTS to the specified SD tag. Uses '-'
                        if there is no MCS
  --save-smiles-tag TAG
                        Save the fragment SMILES to the specified SD tag. Uses
                        '-' if there is no MCS
  --times               Print timing information to stderr
  -v, --verbose         Print progress statistics to stderr. Use twice for
                        higher verbosity.
  --version             show program's version number and exit




EXAMPLES
========

Select atom and bond comparison method
--------------------------------------

Find the benzotriazole core of a set of structures using the default
atom typing scheme, where atoms with the same element and aromaticity
match and bonds with the same type match.

% fmcs sample_files/benzotriazole.sdf --verbose
Loaded 3669 structures from sample_files/benzotriazole.sdf     
[#7]:1:[#7]:[#7]:[#6]:2:[#6]:[#6]:[#6]:[#6]:[#6]:1:2 9 atoms 10 bonds (complete search)
Total time 5.89 seconds: load 2.76 fragment 2.66 select 0.42 enumerate 0.06 (MCS found after 3.13)

I used the "--verbose" flag (or '-v') so you can see how many
structures are in the input data set and how long it took. In this
case it took about as long to load the structures as it did to find
the MCS. This is because the 'fragment' step was enough to identify
the MCS.



Do the same for the ar_clustered_3D_MM_3 data set:

% fmcs sample_files/ar_clustered_3D_MM_3.sdf --verbose
Loaded 33 structures from sample_files/ar_clustered_3D_MM_3.sdf     
[#6]-[#6]-[#6]-[#6]-[#6]-[#6]-[#6]-[#6]-[#6](-[#6]-[#6]-[#6]-[#6])-[#6] 14 atoms 13 bonds (complete search)
Total time 0.12 seconds: load 0.03 fragment 0.02 select 0.00 enumerate 0.07 (MCS found after 0.09)

Here the MCS was about 3 times longer than the load time, and most of
the time was spent in running the enumeration stage of the MCS
algorithm.


Compare the previous result to the MCS found by completely ignoring
atom and bond type information:

% fmcs sample_files/ar_clustered_3D_MM_3.sdf --compare topology --verbose
Loaded 33 structures from sample_files/ar_clustered_3D_MM_3.sdf     
[*]~1~[*]~[*]~[*]~[*]~2~[*]~[*]~[*]~3~[*](~[*]~1~2)~[*]~[*]~[*]~1~[*]~[*]~[*]~[*]~1~3 17 atoms 20 bonds (complete search)
Total time 0.84 seconds: load 0.03 fragment 0.02 select 0.00 enumerate 0.79 (MCS found after 0.74)

This took longer because it's more difficult to prune the MCS search
space when any atom and bond can match any other atom and bond.

You should wonder why there is a difference. It looks like the
benzotriazole data set also contains benzotriazole-like structures,
where some of the bond types are different than I expected that they
would be.


Find the threshold MCS
----------------------

The normal MCS algorithm finds the largest substructure which is
common to all of the input structures. The 'threshold MCS' finds the
largest subgraph which is common to some percentage of the input
structure. This is specified with the --threshold option, which must
be a value between 0.0 and 1.0.

% fmcs sample_files/cox2.sdf
[#6]:1:[#6]:[#6]:[#6](-[#7]:2:[#6]:[#6]:[#7]:[#6]:2-[#6]:2:[#6]:[#6]:[#6]:[#6]:[#6]:2):[#6]:[#6]:1 17 atoms 19 bonds (complete search)
% fmcs sample_files/cox2.sdf --threshold 0.95
[#6]-[#6]:1:[#6]:[#7](:[#6](-[#6]:2:[#6]:[#6]:[#6]:[#6]:[#6]:2):[#7]:1)-[#6]:1:[#6]:[#6]:[#6]:[#6]:[#6]:1 18 atoms 20 bonds (complete search)
% fmcs sample_files/cox2.sdf --threshold 0.90
[#6]-[#6]:1:[#6]:[#7](:[#6](-[#6]:2:[#6]:[#6]:[#6]:[#6]:[#6]:2):[#7]:1)-[#6]:1:[#6]:[#6]:[#6](-[#16](=[#8])=[#8]):[#6]:[#6]:1 21 atoms 23 bonds (complete search)

This is useful when looking at similarity clustering based on
approximate methods like fingerprints. These tend to group structural
neighbors together, but occasionally include compounds which aren't
structurally related. You might use the threshold option to find the
MCS which is in at least 90% of the structures, in hopes of ignoring
the odd entries. I conjecture that you could look at the MCS sizes at
different levels of threshold to determine overall group similarity.


Generating fragment SMILES
--------------------------

The default output is a single line sent to stdout. It will be in one
of three formats, depending on if an MCS was found and if the the MCS
search timed out:

% fmcs sample_files/p38_clustered_3D_MM_9.sdf 
[#6]:[#6](:[#6]:[#6]:[#6])-[#6]:[#6]-[#6]:[#6]:[#6]:[#7]:[#6] 12 atoms 11 bonds (complete search)

% fmcs sample_files/egfr_clustered_3D_MM_2.sdf
No MCS found

% fmcs tests/lengthy.smi --timeout 3
[#6]-[#8]-[#6]:1:[#6]:[#6]:[#6](-[#6](-[#6]:2:[#6]:[#6]:[#6](:[#6]:[#6]:2)-[#8])-[#7](-[#6]-[#6])-[#6](-[#8]-[#6])=[#8]):[#6]:[#6]:1 23 atoms 24 bonds (timed out)

If an MCS is found then the line is the SMARTS followed by the number
of atom, the word "atoms", the number of bonds, and the word
"bonds". If the timeout was reached then the phrase "(timed out)" is
added to the end, otherwise it is a "(complete search)".

That's the output you get with the default "--output-format" of "smarts".

With the "fragment-smiles" output format, the output is the SMILES for
the fragment and the record identifier. (Note: if there is no MCS then
the line will be a blank character followed by the id.)

% fmcs sample_files/p38_clustered_3D_MM_9.sdf --output-format fragment-smiles
cccc(c)-cc-cccnc ZINC03832128


Normally this reports a match based on one record. You can get a
fragment SMILES for all of the input records using "--output-all"

% fmcs sample_files/p38_clustered_3D_MM_9.sdf --output-format fragment-smiles --output-all
cccc(c)-cc-cccnc ZINC03832128
cccc(c)-cc-cccnc ZINC03815736
cccc(c)-cc-cccnc ZINC03832064
cccc(c)-cc-cccnc ZINC03815693
   ...

but that's pretty boring when using the the default comparison. The
"--compare elements" option (which ignores bond types) gives a bigger
MCS, and with a bit of post-processing I'll list the unique fragment
SMILES matched by that MCS.

% fmcs.py sample_files/p38_clustered_3D_MM_9.sdf --compare elements --output-format fragment-smiles \
       --output-all | awk '{print $1}' | sort -u
cc-c(cc-c)c-ccccnc
cc-cc(-cc)ccCCC[NH2+]C
cccc(cc)-cc-cccnc

You can easily see that the elements match but that bond types were ignored.


Separately specify --atom-compare and --bond-compare options
---------------------------------------------------------------

The "--compare" option is a shortcut for common, chemically reasonable
comparisons. You can specify the atom and bond comparison separately
using "--atom-compare" and "--bond-compare", which means you can do
some pretty strange (or interesting!) MCS searches.

Here I explicitly use the default of "--atom-compare elements":

% fmcs --atom-compare elements sample_files/na_clustered_3D_MM_1.sdf 
[#6]-[#6]-[#7]-[#6] 4 atoms 3 bonds (complete search)

Or, here I use "--atom-compare any" to see subgraphs with the same
bond pattern.

% fmcs --atom-compare any sample_files/na_clustered_3D_MM_1.sdf
[*]-[*](-[*]-[*])=[*] 5 atoms 4 bonds (complete search)

With the right output options I can even see what the SMARTS pattern matches:

% fmcs --atom-compare any --output-all --output-format fragment-smiles sample_files/na_clustered_3D_MM_1.sdf 
cNC(=O)C ZINC03581099
cNC(=O)C ZINC03581100
CNC(=O)C ZINC03833958
CNC(=O)O ZINC03581810
CNC(=O)C ZINC04134481
cNC(=O)C ZINC03833955
cNC(=O)C ZINC03833968
cNC(=O)C ZINC03833957
cNC(=O)C ZINC03833960
C=C(C)C[O-] ZINC04134482
CNC(N)=O ZINC03833959
C=C(C)C[O-] ZINC04134483
  ...



Generating SD files and SD tags
-------------------------------

So far I've shown examples of generates SMILES output files. You can
get the result as an SD file in one of two forms, either a "fragment-sdf":

% python fmcs.py sample_files/na_clustered_3D_MM_1.sdf --output-format fragment-sdf
ZINC03581099
     RDKit          

  4  3  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0
  2  3  1  0
  3  4  1  0
M  END
> <Cluster>
1

> <MODEL.CCRATIO>
1

> <MODEL.SOURCE>
CORINA 3.44 0027  09.01.2008

> <b_mmffld_Minimization_Converged-OPLS_2005>
1

> <id>
ZINC03581099

> <r_mmffld_Potential_Energy-OPLS_2005>
-34.241

> <r_mmffld_RMS_Derivative-OPLS_2005>
6.13019e-05

> <source>
na_clustered_3D_MM.sdf

$$$$


or as a "complete-sdf". On its own, "complete-sdf" is pretty
useless. It's the same structure as the input, except that it's been
processed by RDKit. What you likely want with this option is to put
the match information into an SD tag field. Here I'll omit a lot of
the SD record, and focuse on the new tag fields.

% fmcs sample_files/na_clustered_3D_MM_1.sdf --output-format complete-sdf \
    --save-counts-tag MCS_COUNTS --save-smarts-tag MCS_SMARTS --save-smiles-tag MCS_SMILES \
    --save-atom-indices-tag MCS_ATOMS

 ...
M  END
> <Cluster>
1

> <MCS_ATOMS>
0 1 3 4

> <MCS_COUNTS>
1 4 3

> <MCS_SMARTS>
[#6]-[#6]-[#7]-[#6]

> <MCS_SMILES>
CC(=O)Nc1c(N)cc(C([O-])=O)cc1

> <MODEL.CCRATIO>
1
 ...


The "atom indices" tag is the list of atom indices which made up the
MCS. In this case, the carbon at index 0 is the first atom of the MCS,
the carbon at index 1 is the second, the nitrogen at index 3 is the
third, and the carbon at index 4 is the fourth atom of the MCS.

The "counts" tag contains the number of fragments, number of atoms,
and number of bonds which made up the MCS. With this version of fmcs
the number of fragments is either 0 or 1, but future versions may
support disconnected MCSes.

The "smarts" and "smiles" tags contain the fragment SMARTS and SMILES.


Save to a file instead of stdout
--------------------------------

Use "--output" followed by a filename to save the results to a file,
rather than to stdout.

% fmcs --compare topology --output-all --output mcs_output.sdf \
   --output-format complete-sdf --save-smiles-tag fmcs_smiles sample_files/p38_clustered_3D_MM_9.sdf

% grep --after 1 fmcs_smiles mcs_output.sdf | head -9
> <fmcs_smiles>
c1c(-c2ccccc2)[nH]c(-c2cnccc2)c1-c1ccncc1
--
> <fmcs_smiles>
Fc1ccc(-c2[nH]c(-c3ccccc3)nc2-c2ccncc2)cc1
--
> <fmcs_smiles>
Nc1ccc(-c2cc(-c3ccncc3)c(-c3ccccc3)[nH]2)cc1
--


Modify the bond comparisons
---------------------------

The "--ring-matches-ring-only" option modifies the bond comparison so
that ring bonds only match ring bonds and chain bonds only match chain
bond. (A "chain bond" is one which isn't in a ring.)

Even that might be too generous; if a bond is in a ring in the
original structure, and it's part of the MCS, then you might want that
bond to still be in a ring of the MCS. For that case, use the
--complete-rings-only option. (This implies --ring-matches-ring-only.)

I don't have a good dataset to show why you would want to use one of
these options, in part because most of the data sets contain similar
structures and --ring-matches-ring-only doesn't give a different
result. I can show you the difference --complete-rings-only makes:

% fmcs --output-format fragment-smiles sample_files/pdgfrb_clustered_3D_MM_4.sdf
ccc(c)-c1cc(c)cnc1 ZINC03832246

% fmcs --output-format fragment-smiles --complete-rings-only sample_files/pdgfrb_clustered_3D_MM_4.sdf
c-c1cccnc1 ZINC03832246

You can see how the first includes a partial ring (the "ccc(c)") and a
complete ring, while the second only includes the complete
ring. 

This second output also shows an interesting nuance. The
--ring-matches-ring-only option requires the ring bonds only match
ring bonds, but it says nothing about the atoms. In this case there
are two rings, with a single bond connecting them. The atoms on either
side of the single bond are in rings, but the bond itself is not.

A stricter version of this might require that ring atoms also only
match ring atoms, but then "C1CCCCC1O" and "CO" would not find the MCS
of "CO."


Times and timeouts
------------------

The fmcs algorithm finds progressively larger common structures in its
search to find the MCS. Searches can take a very long time to
complete. Use the "--timeout" option to tell fmcs to stop after a
certain number of seconds and report the largest structure found up to
that point.

% fmcs --output-format fragment-smiles --times tests/lengthy.smi 
CCN(C(=O)OC)C(c1ccc(O)cc1)c1ccc(OC)cc1 CHEMBL311765
Total time 6.18 seconds: load 0.00 fragment 0.00 select 0.00 enumerate 6.18 (MCS found after 0.53)

I used the "--times" option, which displays to stderr how long
different aspects of the program took. I described the format of this
line earlier when I talked about the --verbose flag. The --verbose
flag automatically toggles the --times option.

Here it says that the complete search finished after 6.18 seconds, but
the MCS was found in only 0.53 seconds. This means most of the time is
spent proving that the current best common substructure really is the
maximum common substructure. (And it tells me, the developer, that I
ought to implement some of the other search space pruning methods
which are possible.)

My observation is that for lengthy searches like this, either the MCS
is found quickly, and most of the time is spent verifying that it's
really the MCS, or that the quickly-found common substructure is close
to the MCS is size. If you're optimistic, you can set the --timeout to
stop after a certain time with the current best match.


% fmcs --timeout 1 --output-format fragment-smiles --times tests/lengthy.smi
CCN(C(=O)OC)C(c1ccc(O)cc1)c1ccc(OC)cc1 CHEMBL311765
Total time 1.01 seconds: load 0.00 fragment 0.00 select 0.00 enumerate 1.00


Verbosity
---------

If you want to see more progress information during process - I know I
don't like seeing a terminal with no output - you can use the
"--verbose" flag twice. Since "--verbose --verbose" is rather lengthy,
you might use the single letter version "-v -v", or the even more
opaque "-vv" version.


% fmcs -vv tests/lengthy.smi
Loaded 2 structures from tests/lengthy.smi     
Best after 0.0s: 2 atoms 1 bonds [#6]-[#8]
Best after 0.0s: 3 atoms 2 bonds [#6]-[#8]-[#6]
Best after 0.0s: 4 atoms 3 bonds [#6]-[#8]-[#6]:[#6]
Best after 0.0s: 5 atoms 4 bonds [#6]-[#8]-[#6](:[#6]):[#6]
Best after 0.0s: 6 atoms 5 bonds [#6]-[#8]-[#6](:[#6]:[#6]):[#6]
  ....
Best after 0.2s: 21 atoms 23 bonds [#6]-[#8]-[#6](-[#7]-1-[#6]-[#6]-[#6]:2:[#6]:[#6]:[#6]:[#6]:[#6]:2-[#6]-1-[#6]:1:[#6]:[#6]:[#6](:[#6]:[#6]:1)-[#8])=[#8]
Best after 0.2s: 22 atoms 24 bonds [#6]-[#8]-[#6](-[#7]-1-[#6]-[#6]-[#6]:2:[#6]:[#6](:[#6]:[#6]:[#6]:2-[#6]-1-[#6]:1:[#6]:[#6]:[#6](:[#6]:[#6]:1)-[#8])-[#8])=[#8]
Best after 0.5s: 23 atoms 24 bonds [#6]-[#8]-[#6]:1:[#6]:[#6]:[#6](-[#6](-[#6]:2:[#6]:[#6]:[#6](:[#6]:[#6]:2)-[#8])-[#7](-[#6]-[#6])-[#6](-[#8]-[#6])=[#8]):[#6]:[#6]:1
1493 tests of 968 unique SMARTS, cache: 206 True 319 False, search: 268 True 700 False (968 substructure tests)
  465 subgraphs enumerated, 407 processed
2899 tests of 1819 unique SMARTS, cache: 408 True 672 False, search: 491 True 1328 False (1819 substructure tests)
   ....
9012 tests of 5714 unique SMARTS, cache: 1070 True 2228 False, search: 1512 True 4202 False (5714 substructure tests)
  2573 subgraphs enumerated, 2573 processed
[#6]-[#8]-[#6]:1:[#6]:[#6]:[#6](-[#6](-[#6]:2:[#6]:[#6]:[#6](:[#6]:[#6]:2)-[#8])-[#7](-[#6]-[#6])-[#6](-[#8]-[#6])=[#8]):[#6]:[#6]:1 23 atoms 24 bonds (complete search)
Total time 6.20 seconds: load 0.00 fragment 0.00 select 0.00 enumerate 6.19 (MCS found after 0.54)


It writes information to stderr every time a better substructure is
found, and gives progress statistics every second. I don't know if
it's generally meaningful to others, but I like the warm fuzzy feeling
of seeing occasional progress information. If a program gives no no
output for a long period of time then I wonder if it's frozen for some
reason.


Isotopes and atom classes in SMILES
-----------------------------------

If you looked carefully at the --atom-compare options you'll see that
you can compare atoms by isotope number.

That's a very strange ability, yes?

It's available because in SMARTS the isotope number is essentially the
only parameter which doesn't affect the chemistry, and because
(nearly) every cheminformatics toolkit lets you use an relatively
arbitrary integer in that field. For example, RDKit easily lets you
have isotope weights from 1 to 1000. (Note that the isotope value "0"
has the special meaning of "unspecified/natural isotopic abundance".)

Suppose you want halogens to be considered identical, and all other
atoms to be compared based on the element number, and you have a
SMILES file which looks like this:

C1CCC(I)CC1 arom1
C1C(Cl)CC(F)CC1 arom2
C1CC(F)C(F)CC1 arom3
C1CC(Br)C(I)C(C)C1 arom4
C1CCC(I)C(F)C1 arom5
C1C(Cl)C(F)C(C)C(I)C1 arom6

(This file is in the distribution under sample_files/smsd_arom.smi )

First you need to set the isotope number. Here's a program to set the
halogens to class 9 (that is, all halogens have isotope 9) and set
everything else to an isotope equal to its atomic number. Remember,
we're treating the molecular graph as a graph, and not a molecule!


################# halogen_classes.py
import sys
from rdkit import Chem

input_filename = sys.argv[1]
output_filename = sys.argv[2]
if not input_filename.endswith(".smi"):
    raise SystemExit("input filename must be a SMILES file with extension '.smi'")
if not output_filename.endswith(".smi"):
    raise SystemExit("output SMILES filename must end with the extension '.smi'")

writer = Chem.SmilesWriter(output_filename, includeHeader=False, isomericSmiles=True)
for mol in Chem.SmilesMolSupplier(input_filename):
  for atom in mol.GetAtoms():
    if atom.GetAtomicNum() in (9, 17, 35, 53, 85):
      atom.SetMass(9)
    else:
      atom.SetMass(atom.GetAtomicNum())
  writer.write(mol)
writer.close()
#################


When I run this program I get (also available as test/smsd_arom_isotopes.smi):

% python halogen_classes.py sample_files/smsd_arom.smi test.smi
% cat test.smi 
[9F][6CH]1[6CH2][6CH2][6CH2][6CH]([9Cl])[6CH2]1 arom2
[9F][6CH]1[6CH2][6CH2][6CH2][6CH2][6CH]1[9F] arom3
[6CH3][6CH]1[6CH2][6CH2][6CH2][6CH]([9Br])[6CH]1[9I] arom4
[9F][6CH]1[6CH2][6CH2][6CH2][6CH2][6CH]1[9I] arom5
[6CH3][6CH]1[6CH]([9F])[6CH]([9Cl])[6CH2][6CH2][6CH]1[9I] arom6


And here you see that it made a difference.

% fmcs sample_files/smsd_arom.smi
[#6]-1-[#6]-[#6]-[#6]-[#6]-[#6]-1 6 atoms 6 bonds (complete search)

% fmcs --atom-compare isotopes test.smi
[9*]-[6*]-1-[6*]-[6*]-[6*]-[6*]-[6*]-1 7 atoms 7 bonds (complete search)

And what exactly did it match?

% fmcs --atom-compare isotopes --output-format fragment-smiles --output-all test.smi
FC1C[CH]CCC1 arom2
FC1[CH]CCCC1 arom3
BrC1[CH][CH]CCC1 arom4
FC1[CH]CCCC1 arom5
FC1[CH][CH]CC[CH]1 arom6


Isotopes and atom classes in SD files
-------------------------------------


The SMILES format is flexible enough to be mistreated this way. The SD
format is not. Its isotope range is much more limited, which makes
this option more difficult to use directly.

Instead, specify the atom classes via one of the SD tag data
fields. I'll use the same SMILES data set as before but save the
output to an SD file.


################# halogen_tag.py
import sys
from rdkit import Chem

input_filename = sys.argv[1]
output_filename = sys.argv[2]
if not input_filename.endswith(".smi"):
    raise SystemExit("input filename must be a SMILES file with extension '.smi'")
if not output_filename.endswith(".sdf"):
    raise SystemExit("output SD filename must end with the extension '.sdf'")

writer = Chem.SDWriter(output_filename)
for mol in Chem.SmilesMolSupplier(input_filename):
  atom_classes = []
  for atom in mol.GetAtoms():
    if atom.GetAtomicNum() in (9, 17, 35, 53, 85):
      atom_classes.append("9")
    else:
      atom_classes.append(str(atom.GetAtomicNum()))
  atom_class_data = " ".join(atom_classes)
  mol.SetProp("atom_classes", atom_class_data)
  writer.write(mol)
writer.close()
#################

Here's the conversion step.

% python halogen_tag.py sample_files/smsd_arom.smi test.sdf

The tag data is a space separated list of atom classes, as integers
between 1 and 1000.

% tail test.sdf 
  6  7  1  0
  6  8  1  0
  8  9  1  0
  8 10  1  0
 10  1  1  0
M  END
>  <atom_classes>  (5) 
6 6 9 6 9 6 6 6 9 6

$$$$

You can read the "6 6 9 6 9 6 6 6 9 6" as "carbon, carbon, halogen,
carbon, halogen, ...". Then run fmcs to get the SMARTS, where the
isotope numbers match the atom classes.

% fmcs --atom-class-tag atom_classes test.sdf
[9*]-[6*]-1-[6*]-[6*]-[6*]-[6*]-[6*]-1 7 atoms 7 bonds (complete search)

Since you're using an SD file as input, you probably want an SD file
as output, rather than this strange isotope SMARTS format. You can get
the SD file output where the record contains the MCS as a fragment
(remember to use "--output-all" to get all of the structures, and
"--output" to specify a filename rather than stdout):

% fmcs --atom-class-tag atom_classes --output-format fragment-sdf test.sdf
arom2
     RDKit          

  7  7  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 Cl  0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  2  1  1  0
  3  2  1  0
  4  3  1  0
  5  4  1  0
  6  5  1  0
  7  6  1  0
  2  7  1  0
M  END
> <atom_classes>
9 6 6 6 6 6 6

$$$$


Notice how the atom class tag ("atom_classes" for this case) was
modified so it contains the correct list of atom classes for the
fragment.

Or, use "--output-format complete-sdf" to get the full SD structure. 


% fmcs --output-format complete-sdf test.sdf
usage: fmcs [-h] [--maximize {atoms,bonds}] [--min-num-atoms INT]
            [--compare {topology,elements,types}]
            [--atom-compare {any,elements,isotopes}]
            [--bond-compare {any,bondtypes}] [--atom-class-tag TAG]
            [--ring-matches-ring-only] [--complete-rings-only]
            [--select SELECT] [--timeout SECONDS] [--output FILENAME]
            [--output-format {smarts,fragment-smiles,fragment-sdf,complete-sdf}]
            [--output-all-structures] [--save-atom-class-tag TAG]
            [--save-counts-tag TAG] [--save-atom-indices-tag TAG]
            [--save-smarts-tag TAG] [--save-smiles-tag TAG] [--times] [-v]
            [--version]
            filename
fmcs: error: Using --output-format complete-sdf is useless without at least one of --save-atom-indices-tag, --save-smarts-tag, --save-smiles-tag, or --save-counts-tag


Oops! There's no point in doing an MCS search if you don't want to
know the results of the search. I don't know how you want the results
stored, and I don't yet know a good default, so I have to require you
to specify what you want.

I want to know which atoms are in the MCS, as a list of indices, so
I'll use the "--save-atom-indices-tag":

% fmcs --atom-class-tag atom_classes --output-format complete-sdf --save-atom-indices "mcs indices" test.sdf
arom2
     RDKit          3D

  8  8  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 Cl  0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 F   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0
  2  3  1  0
  2  4  1  0
  4  5  1  0
  5  6  1  0
  5  7  1  0
  7  8  1  0
  8  1  1  0
M  END
> <atom_classes>
6 6 9 6 6 9 6 6

> <mcs indices>
2 1 0 7 6 4 3

$$$$


Let's see, so atom 2 (a halogen .. yep, chlorine) is in the MCS, and
atoms 1, 0, 7, 6, 4, and 3, all carbons, are also in the MCS. Yep,
that's what I got before!

Again, remember to use "--output-all" to get all of the structures and
not just one.


Select a subset of the input structures
---------------------------------------

The --select option lets you which select of the input structure to
use, instead of using all of the compounds in the data file. For
example, to use the first 25 records of the benzotriazoles:

% fmcs --select 1-25 sample_files/benzotriazole.sdf
[#6]-[#6]-[#6]-[#7]-[#6](=[#8])-[#6]-[#7]:1:[#7]:[#7]:[#6]:2:[#6]:[#6]:[#6]:[#6]:[#6]:2:1 16 atoms 17 bonds (complete search)

This is a lot larger than the MCS of the entire data set; that being 9
atoms and 10 bonds.

This option was added to help figure out which structure didn't have
the expected MCS. I did a binary search to exclude ranges until I
found the one which was wrong.

However, that's a laborious way to do it. I don't think this option
will continue for the future. Also, a limitation is the implementation
is that it parses all of the structures, even if the structure is
going to be ignored. Eg, using the last 10 structures of a file takes
to parse as using all the structures in a file.
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.