MARMARA TURKISH COREFERENCE RESOLUTION CORPUS
This repository contains the "Marmara Turkish Coreference Resolution Corpus".
The corpus is a layer on top of the "METU-Sabanci Turkish Treebank". Due to license reasons only the coreference layer is published in this repository.
The description of the corpus and a simple Machine Learning baseline for Turkish coreference resolution can be found in the following publication:
Peter Schüller, Kübra Cingilli, Ferit Tunçer, Baris Gün Sürmeli, Aysegül Pekel, Ayse Hande Karatay, and Hacer Ezgi Karakas.
Marmara Turkish Coreference Corpus and Coreference Resolution Baseline.
Technical Report, Marmara University & TU Wien, 2018, Version 2. arXiv:1706.01863
Transforms the 1960 XML files in tb_corrected.zip in the original METU-Sabanci Turkish Treebank distribution into 34 wellformed UTF-8-encoded XML files containing uniquely addressable sentences/words. (Original XML files in the Treebank distribution are partially non-wellformed XML and encoded with windows-1254 encoding.)
The script uses as input the original files in the directory tb_corrected/ in the zip file tb_corrected.zip from the METU-Sabanci Turkish Treebank distribution.
See comments in the script for further instructions.
Converts a document XML and a coreference XML file and a document name into a CONLL file that can be used by the reference scorer.
Converts a CONLL file in the format understood by the reference scorer into a coreference XML file.
For the baseline scorers to work, you will need to initialize and update the submodule of this repository:
git submodule init followed by
git submodule update.
Mention Detection baseline: reads a XML document from the Turkish Treebank and produces a XML document with mentions. Can create dummy chains so that the scorer will provide a mention detection score.
Runs mention detection with
predictmentions.py on all documents and runs scorer.
Takes a list of K pairs of XML documents and gold mention/coreference chain XML files, a directory name for storing output, and a python string for the machine learning method.
Performs K-fold crossvalidation for all K given documents, scores each fold and stores models for each fold.
Takes a document XML and a mention coreference XML file and a model as generated by
Predicts coreference using given mentions and stores it to an output XML file.
Runs crossvalidate_coref.py with appropriate arguments.
predictcoreference.py to demonstrate usage of that tool.
Per default this script runs on the two smallest documents and tests coreference prediction on one of the files. This is not meaningful in terms of scores but it is fast and demonstrates the usage of the script. Running with all documents can take several hours and can take more than 40 GB of RAM, depending on configuration. (SVC method on gold mentions requires less than 10GB.)
- METU Sabanci Turkish Treebank: http://ii.metu.edu.tr/corpus
- Peter Schüller firstname.lastname@example.org
- Marmara University, Istanbul
- KnowLP Research Group http://www.knowlp.com/
- Kübra Cıngıllı (2016)
- Ferit Tunçer (2015,2016)
- Hacer Ezgi Karakaş (2016)
- Barış Gün Sürmeli (2015)
- Ayşegül Pekel (2015)
This project has been supported by The Scientific and Technological Research Council of Turkey (TUBITAK) under grant agreement 114E430.