HTTPS SSH

C12-2103-dataset-converter

Description

This tool is for converting data from MMAX-annotated C12-2103-dataset into a form usable by BART-2.0. It does the following things:

  • Flatten the folder structure from:

    annotation/
        C/
            CNN-NNNN/
                Basedata/*
                Markables/*
                CNN-NNNN.mmax
        D/
            DNN-NNNN/
                Basedata/*
                Markables/*
                DNN-NNNN.mmax
        P/
            PNN-NNNN/
                Basedata/*
                Markables/*
                PNN-NNNN.mmax
    

to something BART can read:

    train/
        Basedata/
            DNN-NNNN.*
            PNN-NNNN.*
        Markables/
            DNN-NNNN.*
            PNN-NNNN.*
        DNN-NNNN.mmax
        PNN-NNNN.mmax
    test/
        Basedata/
            CNN-NNNN.*
        Markables/
            CNN-NNNN.*
        CNN-NNNN.mmax
  • Rename xml attributes so that BART can use them:
    • coref_class => coref_set
    • Lots of attributes are missing from C12-2103:
      • generic="generic-no"
      • person="per3"
      • related_object="no"
      • gram_fnc="adjunct"
      • number="sing"
      • reference="new"
      • category="space"
      • mmax_level="coref"
      • gender="neut"
      • min_words="automotive hall of fame"
      • min_ids="word_14..word_17"
      • coref_set="set_51"
    • Many of these we can't do anything about, unfortunately. Ideally we would fill min_ids and min_words

FlattenFiles usage

FlattenFiles takes two arguments, --sources and --destination; sources can be multiple directories but destination must only be one. The corpus currently contains three sets of annotations, C, D, and P, all located within the annotation/ folder. The idea is to copy two of these to a "train" corpus, and one to a "test" corpus for BART to work with. Thus, to get training:

sbt 'run-main org.tempura.c122103dataset.FlattenFiles
   --sources /path/to/C12-2103-dataset/annotation/D,/path/to/C12-2103-dataset/annotation/P
   --destination /path/to/BART-2.0/C12-2103-dataset/english/train'

And to get test:

sbt 'run-main org.tempura.c122103dataset.FlattenFiles
   --sources /path/to/C12-2103-dataset/annotation/C
   --destination /path/to/BART-2.0/C12-2103-dataset/english/test'

ReplaceAttributes usage

ReplaceAttributes will recursively scan a directory structure, so if test and train are both in the same directory, you can just point it to that directory and let it go:

sbt 'run-main org.tempura.c122103dataset.ReplaceAttributes path/to/flattened/dataset'

Because it overwrites files, it asks you to confirm you want to do this, hit "y" to let it loose.