Source

GenX Referring Expression Corpus / README.txt

Full commit
GenX Referring Expression Corpus
Author: Nicholas FitzGerald (nfitz@cs.washington.edu)

This repository contains a dataset of referring expressions. The dataset
consists images of coloured blocks, some of which are circled. For each image,
20 users of Amazon Mechanical Turk were asked to fill in the blank in the 
sentence "Please pick up _________" in such a way as to instruct a partner to 
pick up the circled blocks.

This data was collected for the following paper (please cite this paper if
using this data for your own research):

Learning Distributions over Logical Forms for Referring Expression Generation
Nicholas FitzGerald, Yoav Artzi, Luke Zettlemoyer
Empirical Methods in Natural Language Processing (EMNLP 2013)

----------------------------------------------------------------------

Here is a directory of the files in this repository:

<images>
    Contains the .png images shown to the subjects on Mechanical Turk.

<state>

<labelling>
    ALL.txt
        Contains all the referring expressions collected from Mechanical Turk.
        These have been preprocessed by converting to lowercase and normalizing
        punctuation.

    ALL_SPELLCHECKED.txt
        All the referring expressions spellchecked, and "the" added to the
        front of any expressions not properly determined (e.g. "brown blocks" -> "the
        brown blocks").

    <all> - contains the data split and labels used for the full task (see
            above EMNLP paper)

    <single> - contains only scenes which contain a single object target set, for the
            single-object subtask.

        <all> and <single> contain the following:

            init - the initialization set used to train the semantic parser
                    which is used to automatically label the bulk of the
                    training data (manually labeled)
            devtest - the development testset (manually labeled)
            devtrain - unlabeled training data which was labeled with the
                    semantic parser
            heldout - the heldout test data (manually labeled)
            LABELED_TRAINING - devtrain data labeled by the trained semantic
                    parser, concatenated with init and devtest


            For devtest, heldout and init, the files labeled "NOBAD" have
            expressions removed which, for various reasons, could not be 
            assigned a meaning representation. This could be for the following 
            reasons:

                - The referring expression was incorrect (did not pick out the
                  right set of objects)
                - The expression used a concept that we do not model (e.g. a
                  spatial relation, size, or material type)
                - The expression was ungrammatical in a way that could not be
                  easily resolved (i.e. was just list of attributes, not a
                  proper noun phrase).