New publications, developments and events
- CNN Is All You Need (Qiming Chen, Ren Wu)
Incredible improvement in BLEU scores - is this for real? Check discussion and see the reason ...
- Attention Is All You Need (Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin)
The transformer model by Google without convolutions nor recurrent network layers
- Convolutional Sequence to Sequence Learning (Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin)
Facebook's convolutional NMT system, translation accuracy comporable to Google's system but much faster.
- Google’s Neural Machine Translation System (Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi)
Various details about Google's NMT model
- SYSTRAN's Pure NMT
A system based on Torch and the Harvard NMT implementation
- Context Gates for Neural Machine Translation (Zhaopeng Tu, Yang Liu, Zhengdong Lu, Xiaohua Liu, Hang Li)
Context gates that control the influence of source and target context when generating words. Intuition: Content words should rely more on source language context whereas function words should look more at target language context. (code available here)
- Modeling Coverage for Neural Machine Translation (Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, Hang Li)
Add a coverage vector to keep track of the attention history to avoid under- and over-translation. (code available here and the older version here)
- Neural Machine Translation with Reconstruction (Zhaopeng Tu, Yang Liu, Lifeng Shang, Xiaohua Liu, Hang Li)
Add a reconstruction layer to improve adequacy of the model. The system needs to reconstruct the source sentence after decoding.
- Google's Multilingual Neural Machine Translation System:
Enabling Zero-Shot Translation (Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, Jeffrey Dean)
Multilingual translation by simply adding a language selection token to the training data, and sharing all other parameters.
- Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism (Orhan Firat, Kyunghyun Cho, Yoshua Bengio)
They go beyond Dong et al. (2015) below, using many-to-many translation. While the number of parameters is linear in the number of languages, as far as I can tell the computational complexity is still quadratic, so it would be challenging with Europarl and out of the question with the Bible corpus.
- Multi-task Sequence to Sequence Learning (Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, Lukasz Kaiser)
- Multi-Task Learning for Multiple Language Translation (Daxiang Dong, Hua Wu, Wei He, Dianhai Yu and Haifeng Wang, ACL 2015)
One-to-many translation using a simple sequence-to-sequence model with attention.
Subword and character based methods
- Fully Character-Level Neural Machine Translation without Explicit Segmentation (Jason Lee, Kyunghyun Cho, Thomas Hofmann)
Character-to-character model with convolutions on the source side to reduce sequence length. They also train with multiple languages by just mixing source sentences from different languages into each minibatch, so the network implicitly learns to identify the language.
- An Efficient Character-Level Neural Machine Translation (Shenjian Zhao, Zhihua Zhang)
Sample source sequence before encoding, resample after decoding. Implemented in Blocks
- Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models (Minh-Thang Luong and Christopher D. Manning, ACL 2016)
They use a limited vocabulary but use character LSTMs instead of <UNK> tokens, both for the encoder and decoder. This works better than a purely word-based approach with heuristics for <UNK> substitution. They also find that purely character-based models work, but are slow (about three months) to train.
- A Character-Level Decoder without Explicit Segmentation
for Neural Machine Translation (Junyoung Chung, Kyunghyun Cho and Yoshua Bengio, ACL 2016)
Using a character-level decoder but subword (byte-pair encoding) encoder, which also works well. Chung et al. also propose a special recurrent network to capture short- and long-range dependencies, but the advantage of this seems pretty limited, although it might help for very long sentences.
- Character-based Neural Machine Translation (Marta R. Costa-jussa` and Jose A. R. Fonollosa, ACL 2016)
Similar idea to Luong and Manning (2016), but using convolution rather than LSTM and applied only on the source side. The target side seems to be completely word-based.
- Neural Machine Translation of Rare Words with Subword Units (Rico Sennrich and Barry Haddow and Alexandra Birch, ACL 2016)
Using byte-pair encoding to create subword units, which can be used out-of-the-box with standard NMT models to reduce data sparsity.
- Hierarchical Multiscale Recurrent Neural Networks (Junyoung Chung, Sungjin Ahn and Yoshua Bengio)
RNN working at multiple layers of segmentation, which are learned in an unsupervised way. Related to their ACL 2016 paper, but with more empirical results (the ACL paper is interesting but does not show very convincing NMT results in my opinion).
Discourse-level NMT / wider context
- Evaluating Discourse Phenomena in Neural Machine Translation (Rachel Bawden Rico Sennrich Alexandra Birch Barry Haddow), test suites for evaluating discourse phenomena in translation
Unsupervised / semi-supervised models
- UNSUPERVISED NEURAL MACHINE TRANSLATION (Mikel Artetxe, Gorka Labaka & Eneko Agirre, Kyunghyun Cho), NMT training without parallel data
- Unsupervised Machine Translation Using Monolingual Corpora Only (Guillaume Lample, Ludovic Denoyer, Marc'Aurelio Ranzato), monolingual sentence representations mapped through latent space
Improved alignment models
- Implicit Distortion and Fertility Models for
Attention-based Encoder-Decoder NMT Model(Shi Feng, Shujie Liu, Mu Li and Ming Zhou, 2016)
Extensions to the basic attention mechanisms that do not assume independence between alignment links (like IBM model 1), using a recurrent attention state.
- Incorporating Structural Alignment Biases into an Attentional Neural Translation Model (Trevor Cohn, Cong Duy Vu Hoang, Ekaterina Vymolova, Kaisheng Yao, Chris Dyer and Gholamreza Haffari, NAACL 2016)
Another approach that borrow ideas from the higher IBM models into attention models for NMT. After skimming through, it seems like they are simply feeding the kind of statistics IBM models use (jump lengths, fertility, etc.) directly into the attention subnetwork.
- Agreement-Based Joint Training for Bidirectional Attention-Based Neural Machine Translation (Yong Cheng, Shiqi Shen, Zhongjun He, Wei He, Hua Wu+, Maosong Sun, Yang Liu)
Agreement between attention-based alignment in different directions
Hybrid models (in whatever sense)
- Pre-Translation for Neural Machine Translation, (Jan Niehues, Eunah Cho, Thanh-Le Ha and Alex Waibel)
Translate training data using PB-SMT and feed this into a neural system: The system learns to use phrase-based translation as additional information when translating source language sentences.
- Neural Machine Translation with External Phrase Memory, (Yaohua Tang, Fandong Meng, Zhengdong Lu, Hang Li, Philip L.H. Yu)
Neural machine translation with a phrase memory. Incorporates phrase pairs in symbolic form, mined from corpus or specified by human experts
- Syntactically Guided Neural Machine Translation (Felix Stahlberg, Eva Hasler, Aurelien Waite and Bill Byrne)
Combines hierarchical SMT with NMT with leads to improvements over individual systems (NMT and hierarchal SMT).
- Incorporating Discrete Translation Lexicons into Neural Machine Translation (Philip Arthur, Graham Neubig, Satoshi Nakamura)
Use discrete translation lexicons in neural MT
Supervision at different layers
- Deep multi-task learning with low level tasks supervised at lower layers (Anders Søgaard and Yoav Goldberg, ACL 2016)
Supervising multi-task learning with some kind of hierarchical structure at multiple layers works well.
Optimization and regularization methods
- A Theoretically Grounded Application of Dropout in Recurrent Neural Networks (Yarin Gal, 2015)
Dropout has been very successful for regularization of different types of networks, but it has been difficult to apply to RNNs. Gal presents a method that actually works, has a theoretical foundation on variational Bayesian method (so it is sometimes referred to as "variational dropout"), and has been adopted by several people already. Drastically reduces overfitting, but comes at the cost of somewhat slower convergence. Implemented in BNAS.
- Layer Normalization (Jimmy Lei Ba, Jamie Ryan Kiros and Geoffrey E. Hinton, 2016)
Similar to Batch Normalization, but normalizing over the nodes in a layer rather than over the same node in a minibatch. Easy to apply to recurrent networks, and our experiments show that their first LSTM variant (equations 20--22) works better than the second one (equations 29--31), although there are issues with numerical stability.
- Multi-Domain Neural Machine Translation through Unsupervised Adaptation (M. Amin Farajian, Marco Turchi, Matteo Negri, Marcello Federico, 2017)
Cool stuff, possibly useful
- Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders (Antonio Valerio Miceli Barone, ACL 2016 representation workshop)
Multilingual word vectors without parallel or even comparable corpora, simply trying to enforce similar distributions between the vector spaces of different languages. Seems to work so-so under these very restricted unsupervised conditions, but could it be used to improve low-resource word vectors?