Source

python-peps / pep-0258.txt

Full commit
   1
   2
   3
   4
   5
   6
   7
   8
   9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58
  59
  60
  61
  62
  63
  64
  65
  66
  67
  68
  69
  70
  71
  72
  73
  74
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
PEP: 258
Title: Docutils Design Specification
Version: $Revision$
Last-Modified: $Date$
Author: David Goodger <goodger@python.org>
Discussions-To: <doc-sig@python.org>
Status: Rejected
Type: Standards Track
Content-Type: text/x-rst
Requires: 256, 257
Created: 31-May-2001
Post-History: 13-Jun-2001


================
Rejection Notice
================

While this may serve as an interesting design document for the
now-independent docutils, it is no longer slated for inclusion in the
standard library.


==========
 Abstract
==========

This PEP documents design issues and implementation details for
Docutils, a Python Docstring Processing System (DPS).  The rationale
and high-level concepts of a DPS are documented in PEP 256, "Docstring
Processing System Framework" [#PEP-256]_.  Also see PEP 256 for a
"Road Map to the Docstring PEPs".

Docutils is being designed modularly so that any of its components can
be replaced easily.  In addition, Docutils is not limited to the
processing of Python docstrings; it processes standalone documents as
well, in several contexts.

No changes to the core Python language are required by this PEP.  Its
deliverables consist of a package for the standard library and its
documentation.


===============
 Specification
===============

Docutils Project Model
======================

Project components and data flow::

                     +---------------------------+
                     |        Docutils:          |
                     | docutils.core.Publisher,  |
                     | docutils.core.publish_*() |
                     +---------------------------+
                      /            |            \
                     /             |             \
            1,3,5   /        6     |              \ 7
           +--------+       +-------------+       +--------+
           | READER | ----> | TRANSFORMER | ====> | WRITER |
           +--------+       +-------------+       +--------+
            /     \\                                  |
           /       \\                                 |
     2    /      4  \\                             8  |
    +-------+   +--------+                        +--------+
    | INPUT |   | PARSER |                        | OUTPUT |
    +-------+   +--------+                        +--------+

The numbers above each component indicate the path a document's data
takes.  Double-width lines between Reader & Parser and between
Transformer & Writer indicate that data sent along these paths should
be standard (pure & unextended) Docutils doc trees.  Single-width
lines signify that internal tree extensions or completely unrelated
representations are possible, but they must be supported at both ends.


Publisher
---------

The ``docutils.core`` module contains a "Publisher" facade class and
several convenience functions: "publish_cmdline()" (for command-line
front ends), "publish_file()" (for programmatic use with file-like
I/O), and "publish_string()" (for programmatic use with string I/O).
The Publisher class encapsulates the high-level logic of a Docutils
system.  The Publisher class has overall responsibility for
processing, controlled by the ``Publisher.publish()`` method:

1. Set up internal settings (may include config files & command-line
   options) and I/O objects.

2. Call the Reader object to read data from the source Input object
   and parse the data with the Parser object.  A document object is
   returned.

3. Set up and apply transforms via the Transformer object attached to
   the document.

4. Call the Writer object which translates the document to the final
   output format and writes the formatted data to the destination
   Output object.  Depending on the Output object, the output may be
   returned from the Writer, and then from the ``publish()`` method.

Calling the "publish" function (or instantiating a "Publisher" object)
with component names will result in default behavior.  For custom
behavior (customizing component settings), create custom component
objects first, and pass *them* to the Publisher or ``publish_*``
convenience functions.


Readers
-------

Readers understand the input context (where the data is coming from),
send the whole input or discrete "chunks" to the parser, and provide
the context to bind the chunks together back into a cohesive whole.

Each reader is a module or package exporting a "Reader" class with a
"read" method.  The base "Reader" class can be found in the
``docutils/readers/__init__.py`` module.

Most Readers will have to be told what parser to use.  So far (see the
list of examples below), only the Python Source Reader ("PySource";
still incomplete) will be able to determine the parser on its own.

Responsibilities:

* Get input text from the source I/O.

* Pass the input text to the parser, along with a fresh `document
  tree`_ root.

Examples:

* Standalone (Raw/Plain): Just read a text file and process it.
  The reader needs to be told which parser to use.

  The "Standalone Reader" has been implemented in module
  ``docutils.readers.standalone``.

* Python Source: See `Python Source Reader`_ below.  This Reader is
  currently in development in the Docutils sandbox.

* Email: RFC-822 headers, quoted excerpts, signatures, MIME parts.

* PEP: RFC-822 headers, "PEP xxxx" and "RFC xxxx" conversion to URIs.
  The "PEP Reader" has been implemented in module
  ``docutils.readers.pep``; see PEP 287 and PEP 12.

* Wiki: Global reference lookups of "wiki links" incorporated into
  transforms.  (CamelCase only or unrestricted?)  Lazy
  indentation?

* Web Page: As standalone, but recognize meta fields as meta tags.
  Support for templates of some sort?  (After ``<body>``, before
  ``</body>``?)

* FAQ: Structured "question & answer(s)" constructs.

* Compound document: Merge chapters into a book.  Master manifest
  file?


Parsers
-------

Parsers analyze their input and produce a Docutils `document tree`_.
They don't know or care anything about the source or destination of
the data.

Each input parser is a module or package exporting a "Parser" class
with a "parse" method.  The base "Parser" class can be found in the
``docutils/parsers/__init__.py`` module.

Responsibilities: Given raw input text and a doctree root node,
populate the doctree by parsing the input text.

Example: The only parser implemented so far is for the
reStructuredText markup.  It is implemented in the
``docutils/parsers/rst/`` package.

The development and integration of other parsers is possible and
encouraged.


.. _transforms:

Transformer
-----------

The Transformer class, in ``docutils/transforms/__init__.py``, stores
transforms and applies them to documents.  A transformer object is
attached to every new document tree.  The Publisher_ calls
``Transformer.apply_transforms()`` to apply all stored transforms to
the document tree.  Transforms change the document tree from one form
to another, add to the tree, or prune it.  Transforms resolve
references and footnote numbers, process interpreted text, and do
other context-sensitive processing.

Some transforms are specific to components (Readers, Parser, Writers,
Input, Output).  Standard component-specific transforms are specified
in the ``default_transforms`` attribute of component classes.  After
the Reader has finished processing, the Publisher_ calls
``Transformer.populate_from_components()`` with a list of components
and all default transforms are stored.

Each transform is a class in a module in the ``docutils/transforms/``
package, a subclass of ``docutils.tranforms.Transform``.  Transform
classes each have a ``default_priority`` attribute which is used by
the Transformer to apply transforms in order (low to high).  The
default priority can be overridden when adding transforms to the
Transformer object.

Transformer responsibilities:

* Apply transforms to the document tree, in priority order.

* Store a mapping of component type name ('reader', 'writer', etc.) to
  component objects.  These are used by certain transforms (such as
  "components.Filter") to determine suitability.

Transform responsibilities:

* Modify a doctree in-place, either purely transforming one structure
  into another, or adding new structures based on the doctree and/or
  external data.

Examples of transforms (in the ``docutils/transforms/`` package):

* frontmatter.DocInfo: Conversion of document metadata (bibliographic
  information).

* references.AnonymousHyperlinks: Resolution of anonymous references
  to corresponding targets.

* parts.Contents: Generates a table of contents for a document.

* document.Merger: Combining multiple populated doctrees into one.
  (Not yet implemented or fully understood.)

* document.Splitter: Splits a document into a tree-structure of
  subdocuments, perhaps by section.  It will have to transform
  references appropriately.  (Neither implemented not remotely
  understood.)

* components.Filter: Includes or excludes elements which depend on a
  specific Docutils component.


Writers
-------

Writers produce the final output (HTML, XML, TeX, etc.).  Writers
translate the internal `document tree`_ structure into the final data
format, possibly running Writer-specific transforms_ first.

By the time the document gets to the Writer, it should be in final
form.  The Writer's job is simply (and only) to translate from the
Docutils doctree structure to the target format.  Some small
transforms may be required, but they should be local and
format-specific.

Each writer is a module or package exporting a "Writer" class with a
"write" method.  The base "Writer" class can be found in the
``docutils/writers/__init__.py`` module.

Responsibilities:

* Translate doctree(s) into specific output formats.

  - Transform references into format-native forms.

* Write the translated output to the destination I/O.

Examples:

* XML: Various forms, such as:

  - Docutils XML (an expression of the internal document tree,
    implemented as ``docutils.writers.docutils_xml``).

  - DocBook (being implemented in the Docutils sandbox).

* HTML (XHTML implemented as ``docutils.writers.html4css1``).

* PDF (a ReportLabs interface is being developed in the Docutils
  sandbox).

* TeX (a LaTeX Writer is being implemented in the sandbox).

* Docutils-native pseudo-XML (implemented as
  ``docutils.writers.pseudoxml``, used for testing).

* Plain text

* reStructuredText?


Input/Output
------------

I/O classes provide a uniform API for low-level input and output.
Subclasses will exist for a variety of input/output mechanisms.
However, they can be considered an implementation detail.  Most
applications should be satisfied using one of the convenience
functions associated with the Publisher_.

I/O classes are currently in the preliminary stages; there's a lot of
work yet to be done.  Issues:

* How to represent multi-file input (files & directories) in the API?

* How to represent multi-file output?  Perhaps "Writer" variants, one
  for each output distribution type?  Or Output objects with
  associated transforms?

Responsibilities:

* Read data from the input source (Input objects) or write data to the
  output destination (Output objects).

Examples of input sources:

* A single file on disk or a stream (implemented as
  ``docutils.io.FileInput``).

* Multiple files on disk (``MultiFileInput``?).

* Python source files: modules and packages.

* Python strings, as received from a client application
  (implemented as ``docutils.io.StringInput``).

Examples of output destinations:

* A single file on disk or a stream (implemented as
  ``docutils.io.FileOutput``).

* A tree of directories and files on disk.

* A Python string, returned to a client application (implemented as
  ``docutils.io.StringOutput``).

* No output; useful for programmatic applications where only a portion
  of the normal output is to be used (implemented as
  ``docutils.io.NullOutput``).

* A single tree-shaped data structure in memory.

* Some other set of data structures in memory.


Docutils Package Structure
==========================

* Package "docutils".

  - Module "__init__.py" contains: class "Component", a base class for
    Docutils components; class "SettingsSpec", a base class for
    specifying runtime settings (used by docutils.frontend); and class
    "TransformSpec", a base class for specifying transforms.

  - Module "docutils.core" contains facade class "Publisher" and
    convenience functions.  See `Publisher`_ above.

  - Module "docutils.frontend" provides runtime settings support, for
    programmatic use and front-end tools (including configuration file
    support, and command-line argument and option processing).

  - Module "docutils.io" provides a uniform API for low-level input
    and output.  See `Input/Output`_ above.

  - Module "docutils.nodes" contains the Docutils document tree
    element class library plus tree-traversal Visitor pattern base
    classes.  See `Document Tree`_ below.

  - Module "docutils.statemachine" contains a finite state machine
    specialized for regular-expression-based text filters and parsers.
    The reStructuredText parser implementation is based on this
    module.

  - Module "docutils.urischemes" contains a mapping of known URI
    schemes ("http", "ftp", "mail", etc.).

  - Module "docutils.utils" contains utility functions and classes,
    including a logger class ("Reporter"; see `Error Handling`_
    below).

  - Package "docutils.parsers": markup parsers_.

    - Function "get_parser_class(parser_name)" returns a parser module
      by name.  Class "Parser" is the base class of specific parsers.
      (``docutils/parsers/__init__.py``)

    - Package "docutils.parsers.rst": the reStructuredText parser.

    - Alternate markup parsers may be added.

    See `Parsers`_ above.

  - Package "docutils.readers": context-aware input readers.

    - Function "get_reader_class(reader_name)" returns a reader module
      by name or alias.  Class "Reader" is the base class of specific
      readers.  (``docutils/readers/__init__.py``)

    - Module "docutils.readers.standalone" reads independent document
      files.

    - Module "docutils.readers.pep" reads PEPs (Python Enhancement
      Proposals).

    - Readers to be added for: Python source code (structure &
      docstrings), email, FAQ, and perhaps Wiki and others.

    See `Readers`_ above.

  - Package "docutils.writers": output format writers.

    - Function "get_writer_class(writer_name)" returns a writer module
      by name.  Class "Writer" is the base class of specific writers.
      (``docutils/writers/__init__.py``)

    - Module "docutils.writers.html4css1" is a simple HyperText Markup
      Language document tree writer for HTML 4.01 and CSS1.

    - Module "docutils.writers.docutils_xml" writes the internal
      document tree in XML form.

    - Module "docutils.writers.pseudoxml" is a simple internal
      document tree writer; it writes indented pseudo-XML.

    - Writers to be added: HTML 3.2 or 4.01-loose, XML (various forms,
      such as DocBook), PDF, TeX, plaintext, reStructuredText, and
      perhaps others.

    See `Writers`_ above.

  - Package "docutils.transforms": tree transform classes.

    - Class "Transformer" stores transforms and applies them to
      document trees.  (``docutils/transforms/__init__.py``)

    - Class "Transform" is the base class of specific transforms.
      (``docutils/transforms/__init__.py``)

    - Each module contains related transform classes.

    See `Transforms`_ above.

  - Package "docutils.languages": Language modules contain
    language-dependent strings and mappings.  They are named for their
    language identifier (as defined in `Choice of Docstring Format`_
    below), converting dashes to underscores.

    - Function "get_language(language_code)", returns matching
      language module.  (``docutils/languages/__init__.py``)

    - Modules: en.py (English), de.py (German), fr.py (French), it.py
      (Italian), sk.py (Slovak), sv.py (Swedish).

    - Other languages to be added.

* Third-party modules: "extras" directory.  These modules are
  installed only if they're not already present in the Python
  installation.

  - ``extras/optparse.py`` and ``extras/textwrap.py`` provide
    option parsing and command-line help; from Greg Ward's
    http://optik.sf.net/ project, included for convenience.

  - ``extras/roman.py`` contains Roman numeral conversion routines.


Front-End Tools
===============

The ``tools/`` directory contains several front ends for common
Docutils processing.  See `Docutils Front-End Tools`_ for details.

.. _Docutils Front-End Tools:
   http://docutils.sourceforge.net/docs/user/tools.html


Document Tree
=============

A single intermediate data structure is used internally by Docutils,
in the interfaces between components; it is defined in the
``docutils.nodes`` module.  It is not required that this data
structure be used *internally* by any of the components, just
*between* components as outlined in the diagram in the `Docutils
Project Model`_ above.

Custom node types are allowed, provided that either (a) a transform
converts them to standard Docutils nodes before they reach the Writer
proper, or (b) the custom node is explicitly supported by certain
Writers, and is wrapped in a filtered "pending" node.  An example of
condition (a) is the `Python Source Reader`_ (see below), where a
"stylist" transform converts custom nodes.  The HTML ``<meta>`` tag is
an example of condition (b); it is supported by the HTML Writer but
not by others.  The reStructuredText "meta" directive creates a
"pending" node, which contains knowledge that the embedded "meta" node
can only be handled by HTML-compatible writers.  The "pending" node is
resolved by the ``docutils.transforms.components.Filter`` transform,
which checks that the calling writer supports HTML; if it doesn't, the
"pending" node (and enclosed "meta" node) is removed from the
document.

The document tree data structure is similar to a DOM tree, but with
specific node names (classes) instead of DOM's generic nodes. The
schema is documented in an XML DTD (eXtensible Markup Language
Document Type Definition), which comes in two parts:

* the Docutils Generic DTD, docutils.dtd_, and

* the OASIS Exchange Table Model, soextbl.dtd_.

The DTD defines a rich set of elements, suitable for many input and
output formats.  The DTD retains all information necessary to
reconstruct the original input text, or a reasonable facsimile
thereof.

See `The Docutils Document Tree`_ for details (incomplete).


Error Handling
==============

When the parser encounters an error in markup, it inserts a system
message (DTD element "system_message").  There are five levels of
system messages:

* Level-0, "DEBUG": an internal reporting issue.  There is no effect
  on the processing.  Level-0 system messages are handled separately
  from the others.

* Level-1, "INFO": a minor issue that can be ignored.  There is little
  or no effect on the processing.  Typically level-1 system messages
  are not reported.

* Level-2, "WARNING": an issue that should be addressed.  If ignored,
  there may be minor problems with the output.  Typically level-2
  system messages are reported but do not halt processing

* Level-3, "ERROR": a major issue that should be addressed.  If
  ignored, the output will contain unpredictable errors.  Typically
  level-3 system messages are reported but do not halt processing

* Level-4, "SEVERE": a critical error that must be addressed.
  Typically level-4 system messages are turned into exceptions which
  halt processing.  If ignored, the output will contain severe errors.

Although the initial message levels were devised independently, they
have a strong correspondence to `VMS error condition severity
levels`_; the names in quotes for levels 1 through 4 were borrowed
from VMS.  Error handling has since been influenced by the `log4j
project`_.


Python Source Reader
====================

The Python Source Reader ("PySource") is the Docutils component that
reads Python source files, extracts docstrings in context, then
parses, links, and assembles the docstrings into a cohesive whole.  It
is a major and non-trivial component, currently under experimental
development in the Docutils sandbox.  High-level design issues are
presented here.


Processing Model
----------------

This model will evolve over time, incorporating experience and
discoveries.

1. The PySource Reader uses an Input class to read in Python packages
   and modules, into a tree of strings.

2. The Python modules are parsed, converting the tree of strings into
   a tree of abstract syntax trees with docstring nodes.

3. The abstract syntax trees are converted into an internal
   representation of the packages/modules.  Docstrings are extracted,
   as well as code structure details.  See `AST Mining`_ below.
   Namespaces are constructed for lookup in step 6.

4. One at a time, the docstrings are parsed, producing standard
   Docutils doctrees.

5. PySource assembles all the individual docstrings' doctrees into a
   Python-specific custom Docutils tree paralleling the
   package/module/class structure; this is a custom Reader-specific
   internal representation (see the `Docutils Python Source DTD`_).
   Namespaces must be merged: Python identifiers, hyperlink targets.

6. Cross-references from docstrings (interpreted text) to Python
   identifiers are resolved according to the Python namespace lookup
   rules.  See `Identifier Cross-References`_ below.

7. A "Stylist" transform is applied to the custom doctree (by the
   Transformer_), custom nodes are rendered using standard nodes as
   primitives, and a standard document tree is emitted.  See `Stylist
   Transforms`_ below.

8. Other transforms are applied to the standard doctree by the
   Transformer_.

9. The standard doctree is sent to a Writer, which translates the
   document into a concrete format (HTML, PDF, etc.).

10. The Writer uses an Output class to write the resulting data to its
    destination (disk file, directories and files, etc.).


AST Mining
----------

Abstract Syntax Tree mining code will be written (or adapted) that
scans a parsed Python module, and returns an ordered tree containing
the names, docstrings (including attribute and additional docstrings;
see below), and additional info (in parentheses below) of all of the
following objects:

* packages
* modules
* module attributes (+ initial values)
* classes (+ inheritance)
* class attributes (+ initial values)
* instance attributes (+ initial values)
* methods (+ parameters & defaults)
* functions (+ parameters & defaults)

(Extract comments too?  For example, comments at the start of a module
would be a good place for bibliographic field lists.)

In order to evaluate interpreted text cross-references, namespaces for
each of the above will also be required.

See the python-dev/docstring-develop thread "AST mining", started on
2001-08-14.


Docstring Extraction Rules
--------------------------

1. What to examine:

   a) If the "``__all__``" variable is present in the module being
      documented, only identifiers listed in "``__all__``" are
      examined for docstrings.

   b) In the absence of "``__all__``", all identifiers are examined,
      except those whose names are private (names begin with "_" but
      don't begin and end with "__").

   c) 1a and 1b can be overridden by runtime settings.

2. Where:

   Docstrings are string literal expressions, and are recognized in
   the following places within Python modules:

   a) At the beginning of a module, function definition, class
      definition, or method definition, after any comments.  This is
      the standard for Python ``__doc__`` attributes.

   b) Immediately following a simple assignment at the top level of a
      module, class definition, or ``__init__`` method definition,
      after any comments.  See `Attribute Docstrings`_ below.

   c) Additional string literals found immediately after the
      docstrings in (a) and (b) will be recognized, extracted, and
      concatenated.  See `Additional Docstrings`_ below.

   d) @@@ 2.2-style "properties" with attribute docstrings?  Wait for
      syntax?

3. How:

   Whenever possible, Python modules should be parsed by Docutils, not
   imported.  There are several reasons:

   - Importing untrusted code is inherently insecure.

   - Information from the source is lost when using introspection to
     examine an imported module, such as comments and the order of
     definitions.

   - Docstrings are to be recognized in places where the byte-code
     compiler ignores string literal expressions (2b and 2c above),
     meaning importing the module will lose these docstrings.

   Of course, standard Python parsing tools such as the "parser"
   library module should be used.

   When the Python source code for a module is not available
   (i.e. only the ``.pyc`` file exists) or for C extension modules, to
   access docstrings the module can only be imported, and any
   limitations must be lived with.

Since attribute docstrings and additional docstrings are ignored by
the Python byte-code compiler, no namespace pollution or runtime bloat
will result from their use.  They are not assigned to ``__doc__`` or
to any other attribute.  The initial parsing of a module may take a
slight performance hit.


Attribute Docstrings
''''''''''''''''''''

(This is a simplified version of PEP 224 [#PEP-224]_.)

A string literal immediately following an assignment statement is
interpreted by the docstring extraction machinery as the docstring of
the target of the assignment statement, under the following
conditions:

1. The assignment must be in one of the following contexts:

   a) At the top level of a module (i.e., not nested inside a compound
      statement such as a loop or conditional): a module attribute.

   b) At the top level of a class definition: a class attribute.

   c) At the top level of the "``__init__``" method definition of a
      class: an instance attribute.  Instance attributes assigned in
      other methods are assumed to be implementation details.  (@@@
      ``__new__`` methods?)

   d) A function attribute assignment at the top level of a module or
      class definition.

   Since each of the above contexts are at the top level (i.e., in the
   outermost suite of a definition), it may be necessary to place
   dummy assignments for attributes assigned conditionally or in a
   loop.

2. The assignment must be to a single target, not to a list or a tuple
   of targets.

3. The form of the target:

   a) For contexts 1a and 1b above, the target must be a simple
      identifier (not a dotted identifier, a subscripted expression,
      or a sliced expression).

   b) For context 1c above, the target must be of the form
      "``self.attrib``", where "``self``" matches the "``__init__``"
      method's first parameter (the instance parameter) and "attrib"
      is a simple identifier as in 3a.

   c) For context 1d above, the target must be of the form
      "``name.attrib``", where "``name``" matches an already-defined
      function or method name and "attrib" is a simple identifier as
      in 3a.

Blank lines may be used after attribute docstrings to emphasize the
connection between the assignment and the docstring.

Examples::

    g = 'module attribute (module-global variable)'
    """This is g's docstring."""

    class AClass:

        c = 'class attribute'
        """This is AClass.c's docstring."""

        def __init__(self):
            """Method __init__'s docstring."""

            self.i = 'instance attribute'
            """This is self.i's docstring."""

    def f(x):
        """Function f's docstring."""
        return x**2

    f.a = 1
    """Function attribute f.a's docstring."""


Additional Docstrings
'''''''''''''''''''''

(This idea was adapted from PEP 216 [#PEP-216]_.)

Many programmers would like to make extensive use of docstrings for
API documentation.  However, docstrings do take up space in the
running program, so some programmers are reluctant to "bloat up" their
code.  Also, not all API documentation is applicable to interactive
environments, where ``__doc__`` would be displayed.

Docutils' docstring extraction tools will concatenate all string
literal expressions which appear at the beginning of a definition or
after a simple assignment.  Only the first strings in definitions will
be available as ``__doc__``, and can be used for brief usage text
suitable for interactive sessions; subsequent string literals and all
attribute docstrings are ignored by the Python byte-code compiler and
may contain more extensive API information.

Example::

    def function(arg):
        """This is __doc__, function's docstring."""
        """
        This is an additional docstring, ignored by the byte-code
        compiler, but extracted by Docutils.
        """
        pass

.. topic:: Issue: ``from __future__ import``

   This would break "``from __future__ import``" statements introduced
   in Python 2.1 for multiple module docstrings (main docstring plus
   additional docstring(s)).  The Python Reference Manual specifies:
   
       A future statement must appear near the top of the module.  The
       only lines that can appear before a future statement are:
   
       * the module docstring (if any),
       * comments,
       * blank lines, and
       * other future statements.
   
   Resolution?
   
   1. Should we search for docstrings after a ``__future__``
      statement?  Very ugly.

   2. Redefine ``__future__`` statements to allow multiple preceding
      string literals?

   3. Or should we not even worry about this?  There probably
      shouldn't be ``__future__`` statements in production code, after
      all.  Perhaps modules with ``__future__`` statements will simply
      have to put up with the single-docstring limitation.


Choice of Docstring Format
--------------------------

Rather than force everyone to use a single docstring format, multiple
input formats are allowed by the processing system.  A special
variable, ``__docformat__``, may appear at the top level of a module
before any function or class definitions.  Over time or through
decree, a standard format or set of formats should emerge.

A module's ``__docformat__`` variable only applies to the objects
defined in the module's file.  In particular, the ``__docformat__``
variable in a package's ``__init__.py`` file does not apply to objects
defined in subpackages and submodules.

The ``__docformat__`` variable is a string containing the name of the
format being used, a case-insensitive string matching the input
parser's module or package name (i.e., the same name as required to
"import" the module or package), or a registered alias.  If no
``__docformat__`` is specified, the default format is "plaintext" for
now; this may be changed to the standard format if one is ever
established.

The ``__docformat__`` string may contain an optional second field,
separated from the format name (first field) by a single space: a
case-insensitive language identifier as defined in RFC 1766.  A
typical language identifier consists of a 2-letter language code from
`ISO 639`_ (3-letter codes used only if no 2-letter code exists; RFC
1766 is currently being revised to allow 3-letter codes).  If no
language identifier is specified, the default is "en" for English.
The language identifier is passed to the parser and can be used for
language-dependent markup features.


Identifier Cross-References
---------------------------

In Python docstrings, interpreted text is used to classify and mark up
program identifiers, such as the names of variables, functions,
classes, and modules.  If the identifier alone is given, its role is
inferred implicitly according to the Python namespace lookup rules.
For functions and methods (even when dynamically assigned),
parentheses ('()') may be included::

    This function uses `another()` to do its work.

For class, instance and module attributes, dotted identifiers are used
when necessary.  For example (using reStructuredText markup)::

    class Keeper(Storer):

        """
        Extend `Storer`.  Class attribute `instances` keeps track
        of the number of `Keeper` objects instantiated.
        """

        instances = 0
        """How many `Keeper` objects are there?"""

        def __init__(self):
            """
            Extend `Storer.__init__()` to keep track of instances.

            Keep count in `Keeper.instances`, data in `self.data`.
            """
            Storer.__init__(self)
            Keeper.instances += 1

            self.data = []
            """Store data in a list, most recent last."""

        def store_data(self, data):
            """
            Extend `Storer.store_data()`; append new `data` to a
            list (in `self.data`).
            """
            self.data = data

Each of the identifiers quoted with backquotes ("`") will become
references to the definitions of the identifiers themselves.


Stylist Transforms
------------------

Stylist transforms are specialized transforms specific to the PySource
Reader.  The PySource Reader doesn't have to make any decisions as to
style; it just produces a logically constructed document tree, parsed
and linked, including custom node types.  Stylist transforms
understand the custom nodes created by the Reader and convert them
into standard Docutils nodes.

Multiple Stylist transforms may be implemented and one can be chosen
at runtime (through a "--style" or "--stylist" command-line option).
Each Stylist transform implements a different layout or style; thus
the name.  They decouple the context-understanding part of the Reader
from the layout-generating part of processing, resulting in a more
flexible and robust system.  This also serves to "separate style from
content", the SGML/XML ideal.

By keeping the piece of code that does the styling small and modular,
it becomes much easier for people to roll their own styles.  The
"barrier to entry" is too high with existing tools; extracting the
stylist code will lower the barrier considerably.


==========================
 References and Footnotes
==========================

.. [#PEP-256] PEP 256, Docstring Processing System Framework, Goodger
   (http://www.python.org/dev/peps/pep-0256/)

.. [#PEP-224] PEP 224, Attribute Docstrings, Lemburg
   (http://www.python.org/dev/peps/pep-0224/)

.. [#PEP-216] PEP 216, Docstring Format, Zadka
   (http://www.python.org/dev/peps/pep-0216/)

.. _docutils.dtd:
   http://docutils.sourceforge.net/docs/ref/docutils.dtd

.. _soextbl.dtd:
   http://docutils.sourceforge.net/docs/ref/soextblx.dtd

.. _The Docutils Document Tree:
   http://docutils.sourceforge.net/docs/ref/doctree.html

.. _VMS error condition severity levels:
   http://www.openvms.compaq.com:8000/73final/5841/841pro_027.html
   #error_cond_severity

.. _log4j project: http://logging.apache.org/log4j/docs/index.html

.. _Docutils Python Source DTD:
   http://docutils.sourceforge.net/docs/dev/pysource.dtd

.. _ISO 639: http://lcweb.loc.gov/standards/iso639-2/englangn.html

.. _Python Doc-SIG: http://www.python.org/sigs/doc-sig/



==================
 Project Web Site
==================

A SourceForge project has been set up for this work at
http://docutils.sourceforge.net/.


===========
 Copyright
===========

This document has been placed in the public domain.


==================
 Acknowledgements
==================

This document borrows ideas from the archives of the `Python
Doc-SIG`_.  Thanks to all members past & present.



..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   End: