Application of Annotator Corrections to CHILDES Corpora
This codebase exists to implement the correction of a subset of CHILDES transcripts. This subset was hand-checked by trained annotators at LARC. The following files were checked:
- Valian / 01a.cha
- Valian / 01b.cha
- Valian / 02a.cha
- Valian / 02b.cha
- Valian / 03a.cha
- Valian / 03b.cha
- Valian / 04a.cha
- Valian / 04b.cha
- Valian / 05a.cha
- Valian / 06a.cha
- Valian / 06b.cha
- Valian / 07a.cha
- Valian / 08a.cha
- Valian / 08b.cha
- Bloom70 / Peter / peter01.cha
TODO: which files were done without tagchecker?
Annotators corrected files using tagchecker, a custom python script authored by Linda Liu while at LARC. tagchecker reads CHILDES XML formatted transcripts , lets an annotator step through the file one utterance at a time, and records any errors the annotator observes to a comma-separated value (csv) text file. These files are in the following format, one line per error observed.
Utterance ID, Speaker ID, Main Tier, MOR Tier, Observed Error, Correction, Notes
"150", "*MOT:", "no , what ?", "qn|no , pro:wh|what ?", "qn|no", "co|no",
Correction File Column Definitions
TODO: Does numbering start at 0 in tagchecker?
- Utterance ID
- A sequential ID for each utterance in a file. Numbering starts at 0 in each file.
- Speaker ID
- The three-letter speaker identifier used in CHAT transcription. For example, MOT, CHI, INV for mother, child, and investigator respectively.
- Main Tier
- The verbatim transcript of the utterance.
- CHA Tier
- The lemmatized and POS-annotated version of the Main Tier produced by running the CLAN's MOR and POST tools in sequence.
- Observed Error
- The portion of the CHA tier that in considered erroneous. This can span more than one word.
- The correct annotation that the Observer Error should be replaced with.
- A free-form text field allowing the annotator to add any additional notes.
The tagchecker script reads an XML file as input but writes fragments of plaintext CHA tiers to its output. In order to perform our corrections, we need the transcripts and corrections to be in the same format. I preferred not to do plaintext CHA fragment to XML conversion as this is not a well defined process and seems error prone. Therefore, I conceived of the correction process as a supervised find and replace procedure. The correction-application script would read two files as input: the plaintext CHA transcript and the correction csv targeting that transcript. Ideally this would simply be a process of
1. Read utterance N from CHA file. 2. Read error and correction for utterance N from correction file. 3. Locate the error in the utterance. 4. Ask the user if this error should be replaced with the correction. 5. Apply correction. 6. Repeat from step 1 with N+1.
Complications in Correction Process
In reality, step 3 is complicated by the following scenarios:
- The error specified occurs more than once in the utterance. In this case the target of the correction is ambiguous.
- There is a typo in the correction. In this case the target for the correction cannot be found in the utterance.
Step 5 is complicated by the fact that a correction that dictates either collapsing two words into a compound or splitting a compound apart will require this change to be reflected on the speaker tier. This is actually a transcription error rather than a tagging error, so while it's good to correct this, we should exclude these occurances from our analysis.
Complications in reading Plaintext CHA Transcripts
TODO: Verify the claims in the following paragraph.
The downloadble database of CHILDES transcripts are periodically updated and retagged when CHAT conventions and the tagger's internals change. There doesn't appear to be an archive of snapshots of the state of the CHILDES corpus. Even if there was, it wouldn't help us much as we didn't record the date we retrieved our transcript data. Thus our (dated) XML source for the errors is out of sync with the (current) plaintext CHA files we want to apply corrections to.
Luckily, CHILDES provides a Java-based tool called Chatter which can convert bidirectionally between plaintext CHA and XML. However, this process is still complicated by the out-of-sync data: the changes in notational convetions means that our (outdated) XML file is not longer considered a well-formed CHAT XML file by Chatter. This means some preprocessing is necessary on the our XML files before Chatter will successfully convert them.
Reading the XML files is more straightforward than the plaintext CHA files. There's a formal schema for CHA XML which exhaustively describes all annotations in the document. The document itself is a tree structure where each node is labeled according to its function.
In contrast, the plaintext CHA files use terse syntax for annotations, do not exhibit a tree structure, thus requiring the creation of a custom parser, and I know of no formal document that details the constructs used in these files. CHILDES' CLAN Transcription Manual covers the annotations used in CHA files, but it's somewhat informal and it's not clear that it's kept up-to-date with the available transcripts.Additionally, plaintext CHA files have their main tier (verbatim transcription) and CHA tier (lemmatized words with POS annotations) on separate lines. These often don't line up perfectly do to the complex annotations and further complicate a plaintext CHA parser.