Wiki
Clone wikifeat-morph / Home
feat-morph (LAW)
feat-morph is a tool for manual morphological annotation of corpora.
-
Main Window (Word List) - for navigating through words of the document (browsing, filtering, searching, sorting, etc.)
-
Da Panel - for displaying and disambiguating morphological information (lemmas, tags) of a word. The panel consists of two windows - a filter box (2a) and a list of items (2b). The list of items displays all the lemma-tag items associated with the current word(s) (selected in the main window). The filter box makes it possible to restrict the items to a particular group, e.g., items with a particular lemma, pos or gender.
-
Context View - displays the text of the document with current word(s) highlighted.
Note: Our plan is to eventually incorporate this tool into feat as a plugin for morphological annotation.
License
The code is published under the MIT License. License text
See this page for a synopsis. In short: You can do whatever you want as long as you include the original copyright. Also, we are not responsible for anything.
We use several libraries with their own licenses:
- commons-io - Apache 2 license
- glazedlists - LGPL License and MPL License
- guava - Apache 2 license
- jdom - Apache style license
Installation
- Make sure you have Java Runtime Environment Version 8 (aka 1.8) installed. You can use this online test to determine it.
- Download the latest feat version from the Downloads section of this web.
- Unpack the file to a directory of your choice.
-
- If using MS Windows: Run feat_vert/feat.bat (you can right-click the file and send it as a shortcut to the desktop).
- If using Linux: Run feat_vert/bin/feat_vert
- Run update (Help > Check for Updates) after installation. Note: The updates are unsigned; you can ignore warnings about that.
Input format
Important: The file's extension has to be "vert" and the encoding is hardcoded to be UTF-8. (Yes, it should be user-configurable; see org.purl.jh.law.data.io.VertReader.processLines).
The native format of this tool is the so-called vertikala, an SGML format.
Each token is on a separate line followed by lemmas, each lemma followed by tags. Lemmas are preceded by a tab, tag is preceded by a space. Each token line must be within a sentence (see below).
Except for token lines, the file can contain sgml tags. Similarly to XML, each opening tag should be paired with a closing tag. Unlike in XML, tags do not have to be nested, so a sequence is ok. Currently, we give a special meaning to the following tags:
<p>
- paragraph-
<s>
- sentence (each sentence must be within a paragraph) -
ignoring tags (whatever is within these tags it is not offered for disambiguation)
<h>
- meta information<str>
- page info<e>
- foreign text<o>
- corrected text
If a paragraph or sentence is missing, it is automatically inserted.
TODO add a sample
Support
Development of this application has been supported by:
-
Grant LM2011023 - Czech National Corpus by the Czech Ministry of education
The first version was based on feat, a tool for layered error annotation of learner corpora.
Updated