HTTPS SSH
codeswitch
==========

The script embeddings.py extracts character-level string embeddings from tweet spans.

The script expects the elman executable to be in installed in PATH. Elman can be
obtained from https://bitbucket.org/gchrupala/elman. 

Usage:

usage: embeddings.py [-h] [--type TYPE] [--prefix_size PREFIX_SIZE]
                     [--suffix_size SUFFIX_SIZE]
                     input model

Generate elman embeddings for code-switching data

positional arguments:
  input                 path to tweet file
  model                 path to elman model

optional arguments:
  -h, --help            show this help message and exit
  --type TYPE           type of features (discrete|raw)
  --prefix_size PREFIX_SIZE
                        size of prefix for discrete features
  --suffix_size SUFFIX_SIZE
                        size of suffix for discrete features

Example input file is in data/ne-en-sample-blanks-plain.csv

The output has the same format as the input format, except a
comma-separated list of features is appended in the last column for
each span. Features have the form INDEX:VALUE, with zero-valued features
omitted.  There are two types of features which can be
extracted: 

- raw: The 400-dimensional vectors for each byte are concatenated. The
  feature indices are computed as follows: 
  (SPAN_LENGTH - BYTE_INDEX) * 100 + DIMENSION_INDEX. 
  So for example feature 2201 means dimension  201 for the second byte
  from the end.   

- discrete: The span is reduced to the concatenation of initial
  PREFIX_SIZE bytes and final SUFFIX_SIZE bytes. For each
  400-dimensional vector corresponding to each byte, top-10 largest
  dimensions are kept, and their value set to [VALUE > 0.5]. See the 
  source code for how the feature indices are computed. 

The directory data contains a pre-trained elman model
data/twitter.big.elman.4.414000000.  This model is trained on a sample
of Twitter data and is the model used in [1]. The tweets in the sample
were not filtered in any way and could be in any language. This model
should work for languages commonly found on Twitter such as English,
Spanish, Chinese or Indonesian. 

[1] Grzegorz Chrupała. 2014. Normalizing tweets with edit scripts and
recurrent neural embeddings. ACL.