Commits

Anonymous committed 44cc21e

rtftok01: add some rtftok documentation

Comments (0)

Files changed (1)

writerfilter/source/rtftok/README

+
+                        rtftok: the RTF tokenizer
+
+I. Design
+
+Basically there are 4 steps involved in the tokenizer:
+
+1. Lexer (using flex)
+
+This reads from an input source (RTFInputSource interface), and produces
+a low-level stream of tokens via the RTFScannerHandler interface.
+
+The lexer itself uses the flex "reentrant" option, because the "default"
+C flex lexers are infested with global variables, and the "C++" lexers
+have annoying namespace problems (IIRC it never compiled on some platform).
+This means that a newer flex is required (2.5.4 is too old; at least 
+2.5.33 is required).
+
+2. RTFScannerHandlerImpl
+
+This class implements the RTFScannerHandler interface, and handles string
+buffering and the RTF destination/group stacks.
+The incoming events are either handled here, or passed on to the top of
+the group or destination stack.
+
+3. RTFDestination and RTFGroup
+
+RTFDestination is a base class with various subclasses, each of which
+providing handling for everything that can occur in a particluar RTF
+destination.
+The RTFDestination is responsible for pushing the recieved events to
+the domain mapper, which is the generic part of the import that is used
+for all formats (DOC, DOCX, RTF).
+
+RTFGroup is basically a container for all sorts of attributes and SPRMs.
+
+4. (or perhaps 0.) RTFDocumentImpl
+
+This is wrapped around everything and bootstraps the import.
+
+
+II. Build
+
+Building the RTF tokenizer is somewhat complex; a lot of stuff is generated.
+
+1. makemodel.sh: rtfmodel.xml, tokentoelement.xsl
+
+First, the source of the generated stuff is the RTF spec in DOCX format.
+Because this cannot be put into the repository, the result of the first
+generation step is checked in.
+
+The main result is rtfmodel.xml, which contains all of the RTF keywords,
+plus all the EBNF rules from the spec (currently the EBNF isn't used).
+Thus the rtfmodel.xml contains only facts about the RTF format.
+
+Also, the file tokentoelement.xsl is generated in this step; it contains
+a categorization helper function to tell the various different kinds of
+RTF controls apart.
+
+If something is missing in the rtfmodel.xml, it is probably sufficient
+to add a simple description of another table at the end of spectobnf.xsl.
+
+2. various other generated files
+
+During the build, various other files are generated from rtfmodel.xml.
+These are not checked into the repository.
+
+* an XML file containing all RTF controls is generated via rtftoken.xsl.
+
+* a list of int tokens for RTF controls is generated via rtftokenheader.xsl
+
+* for efficient mapping of RTF controls to int tokens, a hash function
+  is generated using gperf, via rtfgperf.xsl.
+
+* a TokenToId function is generated (mapping from int tokens to strings, for
+  debugging etc.), via rtftokentoid.xsl.
+
+* a function that maps RTF tokens to RTFDestination calls is generated
+  from rtfactions.xml via rtfcontrols.xsl.
+
+* Also, of course, the flex lexer is generated from its input file,
+  RTFScanner.lex.
+
+3. the plain C++ files
+
+Of course there are some plain C++ files that are just compiled as well :)
+