codeswitch ========== The script embeddings.py extracts character-level string embeddings from tweet spans. The script expects the elman executable to be in installed in PATH. Elman can be obtained from https://bitbucket.org/gchrupala/elman. Usage: usage: embeddings.py [-h] [--type TYPE] [--prefix_size PREFIX_SIZE] [--suffix_size SUFFIX_SIZE] input model Generate elman embeddings for code-switching data positional arguments: input path to tweet file model path to elman model optional arguments: -h, --help show this help message and exit --type TYPE type of features (discrete|raw) --prefix_size PREFIX_SIZE size of prefix for discrete features --suffix_size SUFFIX_SIZE size of suffix for discrete features Example input file is in data/ne-en-sample-blanks-plain.csv The output has the same format as the input format, except a comma-separated list of features is appended in the last column for each span. Features have the form INDEX:VALUE, with zero-valued features omitted. There are two types of features which can be extracted: - raw: The 400-dimensional vectors for each byte are concatenated. The feature indices are computed as follows: (SPAN_LENGTH - BYTE_INDEX) * 100 + DIMENSION_INDEX. So for example feature 2201 means dimension 201 for the second byte from the end. - discrete: The span is reduced to the concatenation of initial PREFIX_SIZE bytes and final SUFFIX_SIZE bytes. For each 400-dimensional vector corresponding to each byte, top-10 largest dimensions are kept, and their value set to [VALUE > 0.5]. See the source code for how the feature indices are computed. The directory data contains a pre-trained elman model data/twitter.big.elman.4.414000000. This model is trained on a sample of Twitter data and is the model used in . The tweets in the sample were not filtered in any way and could be in any language. This model should work for languages commonly found on Twitter such as English, Spanish, Chinese or Indonesian.  Grzegorz Chrupała. 2014. Normalizing tweets with edit scripts and recurrent neural embeddings. ACL.