Expert knowledge about spelling and subcorpora is critical for users

Issue #17 new
Promme Bosken created an issue

Users of the FLC need an informal document instructing them on how to interpret search results for subcorpora. For example, a lemma search for Modern Frisian is useless since Modern Frisian has not been lemmatised, so users will not find all the spelling variants of a lemma. Users must be made aware of this.

They must also know that such a lemma search is useful for Middle Frisian. They must be made aware of this difference between subcorpora regarding the interpretation of search results.

For corpora which have not been lemmatised (19th century Frisian, Modern Frisian) it is useful to have an overview of the spelling changes in Frisian. Then a search for all spelling variants can be performed by individual searches for each variant.

So it would be nice if the main entrance page to the FLC would feature a button leading to a text file, visible to all users and modifiable by a few of FA’s linguists, This text file could contain some informal tricks and warnings on how to use the subcorpora, given their present shape. The text file can be adapted every two years. It can be a file of just a couple of pages. It need not be exhaustive, but it must make users aware both of the possibilities and the limitations of each of the subcorpora.

The old KWIC of the Modern Frisian Corpus contained this information in part, that is, it contained an overview of the three main spelling conventions of Frisian between 1800-2000 roughly.

Note I use the term subcorpora, but I should say: results from various time slices. So results from 1600-1800 are from the Middle Frisian, and they receive a different interpretation from results from 1800-2000, because of the difference with respect to lemmatisation. We now simulate a unity, which is simply not there, because of vast differences in annotation and structure between ‘subcorpora’.

(Eric Hoekstra)

Comments (1)

  1. Fryske Akademy repo owner

    In addition to explaining queries could fail with a message such as “queried material not (entirely) lemmatized“. Or depending on which material is selected, query terms could be disabled.

  2. Log in to comment