This tool is for converting data from MMAX-annotated C12-2103-dataset into a form usable by BART-2.0. It does the following things:
Flatten the folder structure from:
annotation/ C/ CNN-NNNN/ Basedata/* Markables/* CNN-NNNN.mmax D/ DNN-NNNN/ Basedata/* Markables/* DNN-NNNN.mmax P/ PNN-NNNN/ Basedata/* Markables/* PNN-NNNN.mmax
to something BART can read:
train/ Basedata/ DNN-NNNN.* PNN-NNNN.* Markables/ DNN-NNNN.* PNN-NNNN.* DNN-NNNN.mmax PNN-NNNN.mmax test/ Basedata/ CNN-NNNN.* Markables/ CNN-NNNN.* CNN-NNNN.mmax
- Rename xml attributes so that BART can use them:
- Lots of attributes are missing from C12-2103:
min_words="automotive hall of fame"
- Many of these we can't do anything about, unfortunately. Ideally we would fill
FlattenFiles takes two arguments,
--destination; sources can be multiple directories but destination must only be one. The corpus currently contains three sets of annotations,
P, all located within the
annotation/ folder. The idea is to copy two of these to a "train" corpus, and one to a "test" corpus for BART to work with. Thus, to get training:
sbt 'run-main org.tempura.c122103dataset.FlattenFiles --sources /path/to/C12-2103-dataset/annotation/D,/path/to/C12-2103-dataset/annotation/P --destination /path/to/BART-2.0/C12-2103-dataset/english/train'
And to get test:
sbt 'run-main org.tempura.c122103dataset.FlattenFiles --sources /path/to/C12-2103-dataset/annotation/C --destination /path/to/BART-2.0/C12-2103-dataset/english/test'
ReplaceAttributes will recursively scan a directory structure, so if test and train are both in the same directory, you can just point it to that directory and let it go:
sbt 'run-main org.tempura.c122103dataset.ReplaceAttributes path/to/flattened/dataset'
Because it overwrites files, it asks you to confirm you want to do this, hit "y" to let it loose.