Speech to Translation Alignment

by Antonis Anastasopoulos

This is the implementation of our EMNLP 2016 paper, An Unsupervised Probability Model for Speech-to-Translation Alignment for Low-Resource Languages.

If you use this code, please cite the paper

  author    = {Anastasopoulos, Antonios  and  Chiang, David  and  Duong, Long},
  title     = {An Unsupervised Probability Model for Speech-to-Translation Alignment of Low-Resource Languages},
  booktitle = {Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing},
  month     = {November},
  year      = {2016},
  address   = {Austin, Texas},
  publisher = {Association for Computational Linguistics},
  pages     = {1255--1263},
  url       = {}

Data preprocessing

The code requires several inputs:

  1. candidate phone boundaries

  2. Voice Activity Detection spans (which are actually silence spans)

  3. Feature representations of the audio signal (MFCC or PLP)

The bash script needs the directory of the audio files as input and produces the necessary outputs. We provide the code for silence detection (utils/ as well as our implementation of the unsupervised method for phone boundary detection of Khanaga et al (utils/ For PLP features extraction, we use the feacalc library from the ICSI set of tools.


A toy example Griko-Italian dataset is available under data. The whole Griko-Italian corpus is available here.

The bash script sets a number of parameters and runs our code. The main script is the python script that takes the following parameters:

  • -m The maximum length of translation sentences we want to limit our training in

  • -k The value for the \lambda parameter of the fast-align parameterization. The higher this value, the more spikey the distribution is along the diagonal. As this value goes to 0, the distribution is closer to IBM 1.

  • -a The directory to write intermediate results for some of the hidden variables of our model.

  • -o The directory to store the alignment output in every iteration.

  • -t The number of EM iterations to run.

  • -l A list of the files that we will use for training.

  • -d The translations for the files that are listed under the -l parameter. The two files should therefore be parallel.

  • -f The directory of the speech features

  • -b The directory with the candidate boundaries

  • -j The directory with the silence detection spans

  • -p The parameter for add-p smoothing for the translation subcomponent.

  • -r The rate at which the audio files were recorder (needed for conversion of the phone boundaries to the 10ms space of the speech frames)

  • -n Normalize the speech frames to have zero mean and unit variance. [optional]


The script utils/ implements our evaluation metric that computes Precision, Recall, and F-score on the number of links between source speech frames and target translation words. Its parameters are:

  • -i The directory with the alignment outputs

  • -g The directory with the gold alignments

  • -l A list of the files that we will want to evaluate on.

  • -m Limit for the length of translation sentences that we evaluate on.

  • -o Output file with utterance-level results as well as overall results.