Clone wiki

tulipac / Home

tulipac: The TuLiPA compiler

TuLiPA is the Tübingen Parsing Architecture. It is an excellent parser for teaching tree-adjoining grammar, except that it assumes an XML format for TAG grammars that is really complicated to work with.

Tulipac, the TuLiPA grammar compiler, fixes this problem by providing a front-end that compiles TAG grammars in a human-readable format to the XML format assumed by TuLiPA.

Tulipac grammar format

A tulipac grammar consists of the following types of elements, which may be interleaved in any order:

  • Declarations of elementary trees, beginning with the tree keyword
  • Declarations of tree families, beginning with the family keyword
  • Declarations of words and lemmas, beginning with the word and lemma keywords
  • Includes statements, beginning with the #include keyword.

Anything that follows two double slashes // up to the end of the line is considered a comment, as in Java or C++ programs.

Identifiers and other strings in the grammar must start with an alphabetical ASCII character, e.g. A-Z and a-z. All further characters in the identifier can be A-Z, a-z, or 0-9. If you would like to include whitespace, special characters, or non-ASCII characters (e.g. Umlauts) in your identifier, enclose it with single quotes ('like this') or double quotes ("like this").

Trees

A tree declaration looks like this:

tree trans:
  s {
    np![case=nom][]
    vp {
      v+
      np![case=acc][]
    }
}

It starts with the tree keyword and the tree name ("trans"), followed by a colon. Then the tree structure is described. Trees consist of nodes; the example tree has five nodes, with syntactic categories s, np, vp, v, and np. Nodes can be of different types:

  • ! = substitution node
  • * = foot node
  • + = lexical anchor
  • all other nodes are standard nodes

You can add a null adjunction constraint to a standard node by adding @NA after the node name; this means that no auxiliary trees may be adjoined at this node.

tree foo:
  s {
    s @NA {
      a+
      s*
    }
  }

Each node may be followed by one or two feature structures of the form [ft1=val1, ft2=val2, ...]. The first feature structure is the top FS of the node; the second feature structure is the bottom FS. If a node is followed by only one FS, this is assumed to be the top FS, and the bottom FS is assumed to be empty. If a node is not annotated with any FSs, both FSs are assumed to be empty.

The values val1, val2, etc. in a feature structure may either be identifiers (such as plural) or variables (such as ?case).

Tree Families

A tree family groups a set of trees together into a logical unit. This is so a lexicon entry can assign a word to a whole tree family at once, avoiding the need to specify all the trees for each word again and again. Tulipac assumes that all trees in the same family have lexical anchors of the same syntactic category.

Define a tree family by simply listing the trees in the family:

family vinf_tv: { vinf_tv, vinf_tv_aux }

If you do not assign an elementary tree to a family, a tree family with the same name of the elementary tree is automatically generated for it.

In the current version of tulipac, a tree can belong only to a single family. The Alto-Tulipac parser does not have this restriction.

Lexicon Entries

You can define word forms and assign them to lemmas and elementary trees using the lemma and word keywords, as follows:

lemma 'schnell': aux_adj [foo=bar] {
  word 'schnelle': [case=nom]
  word 'schnellen': [case=acc]
  word demoword
}

This declares a lemma "schnell" for the elementary tree aux_adj, and declares the words "schnelle" and "schnellen" as inflected word forms of this lemma. For all of these word forms, the top feature structure of the lexical anchor node is unified with [foo=bar]. This feature structure is further unified with [case=nom] for "schnelle" and with [case=acc] for "schnellen". Note that word forms and feature structures are separated by a colon. The "demoword" does not add a feature structure and therefore does not get a colon.

Note that if your lemmas or word forms contain umlauts or other non-ASCII characters, you must enclose them with (single or double) quotes because tulipac does not accept non-ASCII characters in identifiers.

If you wish to simply use the word form itself as the lemma, you can use the following abbreviated word declaration (at top level, i.e. not as part of a 'lemma' declaration):

word 'jagt': <vinf_tv>[tense=present]

This declares the word form "jagt" for the lemma "jagt". Note the <vinf_tv> in angled brackets: This declares lexicon entries for "jagt" for all elementary trees in the tree family vinf_tv. You can also use tree families in this way in lemma declarations.

Includes Statements

It is sometimes convenient to split your grammar into several files, e.g. one file for the elementary trees and one file for the lexicon. You can use the #include keyword to combine these files:

#include "lexicon.tag"

Tulipac will treat this as if you had copied the contents of "lexicon.tag" into your grammar file.

Putting it all together

You can find a simple complete grammar to get you started in the source code repository. Try parsing the "der hund jagt den schnellen hasen" (which is grammatical) and "der hund jagt der schnelle hase" (which fails because of case agreement).

Usage with Alto-Tulipac

The Alto parser now has native support for TAG grammars in tulipac format. Download a current version of Alto, and then run it as follows:

java -cp <alto.jar> de.up.ling.irtg.script.TulipacParser <grammarname>

where <alto.jar> is the filename of your Alto jarfile, and <grammarname> is the filename of your tulipac grammar.

One limitation of Alto-Tulipac is that the start symbol (at the root of the entire derived tree) must be S (uppercase). If you get unexpected parsing errors, check whether you have used lowercase s instead.

Usage with TuLiPA

Tulipac is distributed as a Jar file, which you can obtain from the Downloads page. Assuming that you have a recent version of Java installed on your computer, you can run tulipac as follows:

  java -jar tulipac-1.1.jar <inputfile>

Here <inputfile> is the name of the file that contains your TAG grammar. Tulipac then writes several TuLiPA grammar files to the same directory that contains your input file. If your grammar file was called "foo.tag", then these output files will be called as follows:

  • foo-g.xml - for the "Grammar" field in TuLiPA
  • foo-l.xml - for the "Lemmas" field in TuLiPA
  • foo-m.xml - for the "Morphological entries" field in TuLiPA

You can load them into TuLiPA, enter the start symbol of your TAG grammar under "Axiom" and the sentence you wish to parse under "Sentence", and hit the "Parse" button.

Further steps

I hope you will find tulipac helpful. If you have any further questions regarding tulipac, please do not hesitate to get in touch.

Updated