Java XML Content Model Validator
What is this repository for?
This component was originally written as a content model validator for a XML Schema parser.
It is conceptually similar to a regular expression matcher, with a few specific features such as accepting an empty element or requiring a deterministic finite state machine.
Building and running the project
The maven project file pom.xml can be opened in Eclipse for compilation and testing. To run on its own, all is required is a recent jvm.
It is a standalone component and doesn't require any configuration.
It doesn't depend on any special libraries.
How to run tests
The project comes with a main function, allowing it to be be run from the command line for testing and it takes several arguments:
- to set the level of verbosity (for example to display information about the input expression, syntax tree or state machine)
- a content model in regular expression like format or RPN (Reverse Polish Notation) format
- any number of strings to be matched against the content model.
The implementation of this module is based on a couple of papers: Unambiguity of Extended Regular Expressions in SGML Grammars and especially Regular Expressions into Finite Automata both written by Anne Brüggemann-Klein.
The XML Content Model Validator (CMV) is a standalone module I wrote several years ago as part of a larger Java based XML framework.
Compared to the original project, I updated the current version to use generics when possible and cleaned up the code here and there, but haven't changed the overall structure or logic. This implementation reflects my level of experience and the language features available when it was originally written, and if I had to write it from scratch today I would implement parts of it differently.
For those new to some of these concepts, here are the very basic facts - DTDs (Document Type Definition) and XML Schemas (XSD) define a number of constraints that an XML or XSD document must follow to be considered valid relative to that DTD or Schema.
An XML or XSD document has a hierarchical structure formed of elements, and each element may contain sub-elements, as well as have a number of attributes. The content model is a description of the valid structure of these elements - what and how many sub-elements or attributes they can have, and in which order these sub-elements must appear (the order is not relevant for attributes). For a detailed description of XML Schema content models see [[http://www.w3.org/TR/xmlschema-0/]] and [[http://www.w3.org/TR/xmlschema-1/]]. What's relevant in this context though is that content models are similar in concept and can be described in a regular expression like language. In fact "content model" and "regular expression" will be used interchangeably throughout this document.
The content model regular expressions have specific characteristics compared to ordinary regular expressions. For example, they must be deterministic (see below for a short explanation of what that means). Also, they do not need all the features that a general purpose regular expression usually provide. This module was developed with these factors in mind, but it could be extended to support more features if needed.
The CMV was written in Java but could be easily ported to other languages, due to its modular design.
The module contains the following main logical sub-components:
- an expression expression parser (accepts input in both conventional and Reverse Polish Notation)
- syntax tree
- state machine
- a syntax tree builder
Here is how it works from a conceptual standpoint
The two inputs are the content model (regular expression) and the actual instance structure that needs to be matched against the content model.
- the regular expression is passed to the CMV
- the CMV parses it and generates a syntax tree which is a hierarchical representation of the regular expression, with nodes being operators and leaves being operands.
- the syntax tree is compiled into a deterministic finite state machine (or deterministic finite automaton - DFA).
- the string representing the instance structure as passed to the state machine which is then run. This step will return a true/false result indicating a match or mismatch between the initial regular expression and instance structure. In the case of a "no match" result, more information can be extracted from the state machine to determine what exactly caused the failure.
If the regular expression is complex, the state machine generation can take longer, but a state machine has to be created just once and can then be cached or even serialized (feature not implemented) for faster processing later on.
The validation part is very efficient, as generally all state machine based algorithms are.
Integrating the module with other projects is relatively simple - see the ContentModel.main() function in ContentModel.java for a detailed example.
To use the module, the content model must be represented by an expression containing several operators:
| - OR , - AND ? - 0 or 1 occurrence * - 0 or more occurrences + - 1 or more occurrences [m,n] - between m and n occurrences <empty> - not an operator, used to represent the empty element () - grouping of subexpressions
Note: general purpose regular expression usually don't have an explicit operator AND - simple concatenation achieves the same result. In our case concatenation is used to represent whole words, and the "," (comma) operator is used as AND operator. Words in the instance string (the one that must be validated) must be separated by "," (comma).
A few examples of valid expressions that can be used to describe unambiguous content models
apple? (apple|pear)+ apple?,pear apple?,pear?,cherry? (apple),(apple*,pear)* (apple,pear)*|(melon,banana)|(cherry,peach) (apple[2,3],pear[2,3])[5,6] ((apple?,pear?,cherry)|(mellon*,banana?)+|peach|(apricot,lettuce) - just making sure you're paying attention
and these are examples of ambiguous regular expressions which will trigger runtime exceptions:
a*|(a,b) (a,b)|(a,c) (a,b)*|(a,c) a,(b,a)*,(b|<empty>) a|(a,b) (a|b)*,a,b,b
Command line usage
The ContentModel.main is the entry point when using the module from the command line. Be sure to set the correct environment (classpath for instance) before running it.
The command line accepts the following arguments:
java com.amichel.contentmodel.ContentModel [-?] [-help] [-s] [-e] [-t] [-r] "regular_expression" "string_to_validate_1" "string_to_validate_2"... -help, -? displays the help message -s dumps the state machine (nodes and transitions) -e dumps the content model expression -t dumps the syntax tree -a dump all (equivalent to -s -e -t) -r uses the reverse polish notation for the regular expression, where character "^" represents the "pop" operation
Here are a few typical commands:
java com.amichel.contentmodel.ContentModel "(aa|bb),cc" "aa,cc" java com.amichel.contentmodel.ContentModel -s "(aa|bb),cc" "aa,cc" java com.amichel.contentmodel.ContentModel -a "(aa|bb),cc" "aa,bb,cc" "aa,cc" aa,dd"
Ambiguous vs. unambiguous regular expressions
By design XML only accepts unambiguous content models, and this implementation will also accept only unambiguous regular expressions as input. An exception will be thrown if a content model is ambiguous. An ambiguous regular expression will generate a state machine with at lest one state with 2 ore more exit transitions triggered by the same input symbol (or word). In that case it is obviously not possible to determine the valid transition (without looking ahead in the string).
General purpose regular expressions accept ambiguous expressions as they usually look ahead in case of ambiguity to determine the transition that matches.
The component is available for download as a zip file for the source code and Visual Slick Edit project, as well as a jar file containing the generated class files.
You can email me at firstname.lastname@example.org if you have any feedback or questions.