HTTPS SSH

Hiero

This is the original implementation of Hiero, now open-source. Parts will become available as they are cleaned up. Note, however, that one part will likely never become available, xrsdb, which stores grammar rules on disk and was shared with the ISI syntax-based MT system.

This code is being released under the MIT license. See LICENSE.md for more information.

David Chiang. Hierarchical phrase-based translation. 2007. Computational Linguistics 33(2):201-228.

David Chiang. A hierarchical phrase-based model for statistical machine translation. 2005. In Proc. ACL, 263-270.

Building

Requirements:

In the cython/ subdirectory, run python setup.py build. The generated .so files should be placed in your PYTHONPATH (I just put them in the top-level directory).


The following instructions are dated 22 Sep 2008 and may no longer be accurate.

Building

Requirements: Python 2.5 or later (http://www.python.org) SRI-LM toolkit (http://www.speech.sri.com/projects/srilm/) Under Linux, the libraries must be built using the -fPIC flag (add to GCC_FLAGS in common/Makefile.machine.<arch>) Boost 1.35.0 or later (http://www.boost.org) Pyrex 0.9.5.1a = 0.9.5.1.1 (http://www.cosc.canterbury.ac.nz/greg.ewing/python/Pyrex/) using a later version may give errors An implementation of MPI

  1. Edit pathnames in: boost-build.jam: location of Boost distribution Jamfile: location of Boost, SRI-LM and biglm distribution biglm is internal to ISI/Language Weaver

  2. Get a working bjam (in tools/jam subdirectory of Boost)

  3. Edit your site-config.jam or user-config.jam to configure Python and MPI

  4. Run bjam install-modules. This should install all Python extension modules and their related libraries in the lib/ subdirectory.

  5. Edit setup.sh to point to the lib/ subdirectory.

Running the MIRA trainer (tuning feature weights)

  1. Assuming you are using a Bourne shell, do . setup.sh

  2. trainer.py decoder.ini -w <weights>

-p run on multiple nodes using MPI -w initial weights or weight file format is: feature=weight feature=weight etc. -x name of corpus to run on

The supplied .ini file expects to find the grammar for sentence <n> (starting from 0) in file sentgrammars.<corpus>/grammar.line<n>.

To use the parallelized version, you will need to use mpirun and the mpi4py (MPI-enabled Python interpreter). Set the number of processors to be the number of physical processors plus one.

For the SBMT decoder, use sbmt_decoder.py instead of decoder.ini. It expects an additional argument instead of -x:

-g directory where gars are to be found

Running the decoder

  1. Assuming you are using a Bourne shell, do . setup.sh

  2. decoder.py decoder.ini -w <weights>

-p run on multiple nodes using MPI -w initial weights or weight file format is: feature=weight feature=weight etc. -x name of corpus to run on

The supplied .ini file expects to find the grammar for sentence <n> (starting from 0) in file sentgrammars.<corpus>/grammar.line<n>.

To use the parallelized version, you will need to use mpirun and the mpi4py (MPI-enabled Python interpreter). Set the number of processors to be the number of physical processors plus one.

Some notes on the implementation

Training: alignment.py manipulation of word alignments refiner.py alignment refinement (symmetrization) lexweights.py lexical weights extractor.py extraction into intermediate grammar scorer.py filtering and scoring to produce final grammar

Decoding: decoder.py main decoder module model.py base class for model components, plus some basic models lm.pyx language model srilm.pyx SRI-LM wrapper, plus some intermediate code srilmwrap.cc C wrapper for C++ code

Common: rule.pyx grammar rule objects grammar.py just has some random functions now and should be retired lex.py word-to-number mapping sym.py functions for dealing with nonterminal symbols

Utilities: sgml.py SGML handling log.py logging monitor.py CPU/memory monitoring