Wiki

Clone wiki

Okapi / TMServer

TM Server

A server-side application that uses the TM Engine.

A possible start for a good scalable infrastructure could be Lucene/SOLR.

Segment vs Corpus-based and a Possible Hybrid Architecture

There are two main TM types currently used in the industry: Segment-based and Corpus-based. Segment-based TM's store and retrieve segments out of context, though there are some segment-based architectures which store limited context. Corpus-based TM's store entire documents and retrieved segments are shown in full context of the original source and translations. Corpus-based TM's rely on quick alignment algorithms to match up source and target paragraphs.

Both TM's architectures have their advantages. Localization shops whose bread and butter is built on cost per word rely on segment-based TM's to provide estimates to clients and payment to translators. Whereas free translations (articles, books etc..) or projects with little segment reuse are better suited to corpus-based architectures.

Using multiple indexes it would be possible to have the best of both worlds. A hybrid segment/corpus TM would store and index entire (segmented) documents and index individual segments. The trick would be to not duplicate storage, by using some type of pointer to each segment. Monolingual Okapi resource files could be the standard document unit.

Many issues to work out concerning database schema, bi-lingual vs multilingual models etc.

Finding the Best Matches with a Hybrid TM

  • For each input document query the TM for the best document matches (using document to document similarity much like Google). This should find all the document most similar to the query document. A cosine distance similarity measure could be used as the simplest baseline algorithm.

  • Once the best document matches are found begin leverage at the segment level using only preferred document segment matches.

  • Once all the preferred segment matches are leveraged search the entire segment index for any remaining matches. These would be considered out of context.

  • Meta data, TM Groups could be used to further constrain the search.

Besides this hybrid search method above both pure segment and corpus-based searches could be performed.

Updated