Wiki
Clone wikiNeuralMT / HNMT evaluation
Preliminary conclusions
Ensembling
Ensembling helps in the following ways. To summarize: use method 2 for speed or method 3 for accuracy.
- A proper ensemble of the last 3 savepoints (1 hour interval) gives about 1 BLEU/chrF3 point.
- An averaging "ensemble" (just average model parameters) of the last 3 savepoints gives roughly the same result as a proper ensemble.
- A proper ensemble of 3 independently initialized and trained models is about 2 BLEU/chrF3 points above the baseline, and 1 point above the ensemble models above.
Various notes
- Word-based decoders are pretty bad in both directions. Why?
- Layer Normalization seems to hurt.
- Dropout seems to hurt (Austin says this is particularly true for the decoder side).
- Attention loss at most helps a little bit, perhaps not at all.
- A large source vocabulary helps even with a hybrid encoder (Luong & Manning found the same thing)
newstest2015-enfi
Results are sorted by chrF3, this is the measure that correlates best with human judgement for English-Finnish according to the WMT shared task on evaluation metrics.
Configuration | BLEU | chrF3 |
---|---|---|
Online-B (WMT15) | ... | 49.45 |
HNMT-run21+22+24 | 15.08 | 48.77 |
Google Translate (2016-11-06) | 13.65 | 48.76 |
Abumatran (unconstrained WMT15) | 16.0 | 46.89 |
UU (unconstrained WMT15) | 14.8 | 45.82 |
HNMT-run16+17+18 | 12.94 | 45.44 |
HNMT-run16-average3 | 11.61 | 44.57 |
HNMT-run16-ensemble3 | 11.56 | 44.47 |
HNMT-run16 | 10.71 | 43.40 |
Abumatran (constrained WMT15) | 13.0 | 45.26 |
HNMT-run8-ensemble4 | 11.21 | 41.81 |
HNMT-run8-average-ensemble4 | 10.75 | 41.83 |
HNMT-run8-char-long-align1.0:0.9999 | 10.41 | 41.64 |
HNMT-run9-char-long-align1.0:0.999 | 10.75 | 40.96 |
HNMT-run10-char-long-align1.0:0.999 | 9.23 | 39.50 |
HNMT-run5-char-long-ln | 10.33 | 39.42 |
HNMT-run1-char-short-ln-svoc1k | 9.00 | 38.01 |
HNMT-run7-word-ln | 4.85 | 23.86 |
newstest2016-enfi
Configuration | BLEU | chrF3 |
---|---|---|
Abumatran (constrained WMT16) | 17.5 | 50.55 |
HNMT-run21+22+24 | 16.07 | 50.00 |
UH-OPUS (WMT16) | 16.97 | 49.96 |
UH-factored (constrained WMT16) | 13.53 | 47.29 |
HNMT-run16+17+18 | 14.03 | 46.93 |
HNMT-run16-average3 | 12.78 | 46.05 |
HNMT-run16-ensemble3 | 12.86 | 46.03 |
HNMT-run16 | 11.91 | 45.02 |
HNMT-run8-ensemble4 | 12.52 | 43.26 |
HNMT-run8-char-long-align1.0:0.9999 | 11.55 | 42.91 |
newstest2015-fien
Results are sorted by BLEU.
Configuration | BLEU | chrF3 |
---|---|---|
HNMT-run12-char-long | 12.38 | 36.70 |
HNMT-run13-word-long | 10.47 | 31.95 |
HNMT-run14-word-long-dropout | 9.92 | 30.90 |
Details
The HNMT variants used above are:
- run21+22+24: ensemble between 3 independent models, these use Europarl + 1M of Turku's parallel sentences (1011 version, hopefully the final one) + 1M tokens of backtranslated news.
- run16+17+18: ensemble between 3 independent models, these use Europarl + 1M of Turku's parallel sentences (0811 version, not the final one). From this point, cased and untokenized corpora are used.
- ensemble4: 4 latest savepoints (3 hour intervals) ensembled
- average-ensemble4: 4 latest savepoints (3 hour intervals) model parameters averaged
- char: character-based decoder
- word: word-based encoder (no UNK replacement or anything, always 50k target vocabulary)
- long: sentence length limit (longer sentences are removed entirely): 60 words/360 chars
- short: 30 words/180 chars
- ln: layer normalization is used
- dropout: dropout factor 0.2 is used (default is no dropout)
- svocXXX: size of source vocabulary (always with character backoff, default is 10k)
- alignXXX:YYY: attention loss is used, initial value XXX with exponential decay factor YYY per batch
Other parameter values:
- lowercase yes (on both sides, trained on lowercased + tokenized data)
- 512 dim output LSTM for character-based models, 256 for word based
- 256 dim input LSTM
- 256 dim attention hidden layer
- batch size 128
Example SLURM script (from run16)
#!/bin/bash -l #SBATCH -J hnmt #SBATCH -o hnmt.stdout.%j #SBATCH -e hnmt.stderr.%j #SBATCH -t 72:00:00 #SBATCH -N 1 #SBATCH -p gpu #SBATCH --mem=16384 #SBATCH --gres=gpu:1 #SBATCH --constraint=k80 #SBATCH module purge module load python-env/3.4.1 module load cuda/8.0 module list cd ${SLURM_SUBMIT_DIR:-.} pwd echo "Starting at `date`" SOURCE="en" TARGET="fi" MODEL=/wrk/rostling/models/hnmt/run16-$SOURCE-$TARGET-70h THEANO_FLAGS=optimizer=fast_run,device=gpu,floatX=float32 python3 \ hnmt.py \ --save-model "$MODEL".model \ --log-file "$MODEL".log \ --source /wrk/rostling/wmt16/ep-turku1m."$SOURCE" \ --target /wrk/rostling/wmt16/ep-turku1m."$TARGET" \ --beam-size 4 \ --source-tokenizer word \ --target-tokenizer char \ --max-source-length 100 \ --max-target-length 600 \ --source-lowercase no \ --target-lowercase no \ --dropout 0 \ --word-embedding-dims 256 \ --char-embedding-dims 64 \ --encoder-state-dims 256 \ --decoder-state-dims 512 \ --attention-dims 256 \ --source-vocabulary 25000 \ --min-char-count 2 \ --batch-size 64 \ --save-every 2500 \ --test-every 50 \ --translate-every 1000 \ --training-time 71 echo "Finishing at `date`"
Updated