Wiki

Clone wiki

Okapi / TM_API_Wish_List

TM API

GlobalSight API's

About as simple as it can get:

* LeverageMatches leverageFuzzySegments(Locales, segments, LeverageOptions)
* LeverageMatches leverageExactSegments(Locales, segments, LeverageOptions)

Of course the real meat is in the options and the filtering that happens after the query.

Requirements

Here some ideas of requirements for the TM server:

  1. The caller should be able to specify a source and a target language.
  2. The caller should be able to filter the results based on a set of conditions: attributes and values.
  3. Each result must have an associated score value between 0 and 100.
  4. A score of 100 must corresponds to exactly the same source in the query and the result.
  5. A score below 100 must be meaningful within the result set (e.g a result with a score of 50 should twice a good as a result with a score of 25, or something similar).
  6. There should be an option to retrieve (or not) results that are exact matches from the text viewpoint, but have a difference in attributes.
  7. The server should be able to export its collection of entries with their metadata (probably in TMX).
  8. The server should be able to import entries with their metadata (probably from TMX)
  9. There should be provisions for each entry to have an indicator about the entry that, in the original document, comes just before, and the one that comes just after (e.g. a hash-code).
  10. There should be a way for the caller to query the server with those indicators.
  11. There should be provisions for each entry to store some group-ID and sequence-ID that allow to group together all entries that in the original document belong to a same group (like all the sentences of a paragraph)
  12. There should be a way to add/insert new entries in the repository one by one, along with their metadata.
  13. There should be a way to remove entries from the repository, along with their metadata.
  14. There should be a way to set aside entries from the list on entries queried, but to keep them in the TM (i.e. as 'inactive entries, that could be restored). (Not a very important requirement)
  15. There should be a way to update an existing entry (its translation, its metadata).
  16. The server should allow (maybe optionally) multiple translations and/or metadata for the same source.
  17. The results send back after a query should be ordered based on the score, and for results of the same score, based on attributes, previous.next indicators, etc. as provided by the caller.
  18. The way results are ordered should be clearly documented.
  19. There should be a way for the caller to set a score threshold below which results are not returned.
  20. The source text of the query and the translation results should be in a format that allow them to be taken from or integrated into a TextUnit object (or at least a TextFragment) with little or no conversion.
  21. The query should be able to handle inline codes. And the inline codes should be treated abstractly. That is: two sources with the same abstract inline codes but different corresponding real codes should be treated as equal.
  22. The result should provide easy access (possibly with one additional code) to all the metadata of the entry.
  23. There should be provisions to store the following information for each entry: creation date, modification date, user who created the entry, last user who modified it, (and probably others).
  24. There should be provisions to store annotations and properties with each entry.
  25. There should be a way for the caller to set a maximum number of results to be returned.
  26. Matching options should (possibly) include case-sensitivity, whitespace-sentitivity, code-sensitivity, and more.
  27. The TM engine should be able to fix the translation of the cases that would allow it to make better matches. For example, if the only different in the TM source and the query source are leading/trailing spaces, it should try to change TM source/target so it can get a better match.

Technical Requirements for Okapi TM

Results of discussions with Jim and Dan.

Overview

See http://code.google.com/p/okapi/wiki/TMServer

The Okapi Translation Memory Engine will combine the best features of segment and corpus based TM's. The TM will store documents as Okapi XML resources. Both documents and segments will be versioned with all versions being preserved. It should be possible to quickly determine differences (at the segment level) between any versions of the same document. The TM engine provides queries that return exact and fuzzy segment matches, but also allows direct document to document queries. Document to document queries provide a ranking of documents based on similarity score. These "similar" documents will be used in segment matching to prefer segments in documents that have an overall context similar to the document being leveraged.

The TM will store documents and segments in such a way that segment queries are as efficient as possible. Various indexes will need to be created on the same data that provide the exact, fuzzy, boolean and document to document queries. Mutable operations such as search and replace and other kinds of updates should also be fast.

Any type of metadata can be added at the document, tu, tuv or phrase levels. The TM should not impose any restrictions on the type or number of metadata added. Metadata can be added at any time, either during document, tu or tuv creation or after the fact. Metadata is all indexed and can be queried.

Automated TM Segment Queries and Match Types

These are matches returned mostly from automated queries such as a leveraging step. Matches are in rank order from best to worst match.

  • ID Match - uses a globally unique ID to match segments. An ID match trumps all other match types.

  • Full Document Context Exact Match - An exact match at the segment level that is guaranteed to be translated exactly as the previous version of the same document. For example, version 1 of a document is stored in TM. A version 2 now needs processing. It should be possible to tell exactly what was translated before based on the full context of the document. Full context in this case includes the ordering of the segments as well.

  • Local Context Exact Match - An exact match that has the same local context as the query. A local context is defined as an exact match for n segments before and after the query segment. The default is usually one segment before and after.

  • Structural Exact Match - Same as an Exact match but the structural properties of the segments are the same. For example, structure would be properties such as titles, paragraphs, tables, table cells, lists, menus etc. A match with the same structural properties as the query is preferred over one that is different. Structural matches have higher rank than exact matches.

  • Exact Match - the segment matches everything including formatting, whitespace and case.

  • Fuzzy Match - a fuzzy match can differ from the query segment in a number of ways: (1) white space (2) formatting (3) case (4) affixes (5) words.

  • Phrase Match - a phrase match is a type of fuzzy match that returns sub-phrases from segments found in the TM. For example, a sentence with a sub-phrase "I hate cats" could appear in several TM segment but normally only give a very low fuzzy match when the segments are compared directly. Phrase matches pull out all of these useful phrases,

Manual TM Queries and Match Types

These queries are usually performed manually to check translation consistency or for TM management tasks.

  • Wilcard - Search for segments and phrases using typical regex notation such as * and ?.

  • Boolean - Typical boolean information retrieval query such as "(cat AND dog) OR (feline AND canine)"

  • Proximity - queries that search for strings within a certain distances from each other. For example, "(dog AND cat)~3" = find all segments with dog and cat, but only if they are withing 3 words from each other.

  • Concordance or Phrase - Search for fuzzy sub-strings wihtin the segments of the TM.

  • Consistency - specialized queries that ensure consistent translation between segments or phrases across a corpus. (ask Dan for more detail)

  • SQL-like - similar to boolean but can filter on various metadata such as author, date, domain, project id etc.. All queries can make use of SQL-like queries to filter irrelevant matches.

  • annotation - a special query which searches for segment comments or annotations rather than the segments themselves.

Query Match Filtering

All query results can be filtered to remove unneeded matches. Most filtering happens at the metadata level.

  • Duplicate Filter - A corpus-based TM will store many duplicates (source and target translations are the same). These duplicates can be removed if needed.

  • Metadata Filter - using an SQL-like query (but simplified for users) filter on all metadata types such as domain, project id, filename, creation date etc..

  • MAX Limit Filter - The number of returned hits can be filtered on a maximum, user configurable, number.

  • Threshold Filter - matches can be filtered based on fuzzy threshold with a user definable default of 75%.

  • Most Recent Filter - Older matches may be filtered.

  • Match Type Filter - only certain match types may be allowed (i.e., EXACT only)

Query Match Ranking and Sorting

There is an implied notion of "better" matches. Better matches are preferred over worse matches and are always sorted to the top. "Better" usually corresponds to match type or other user defined criteria such as date and domain keywords. There may be primary, secondary or even tertiary sorts.

Below is a listing of some types of sorting that will be needed. How this all works together needs to be fleshed out.

  • Sort by Match Type - match types have an defined order from better to worse.

  • Document Order Sort - sort by the order in the original document.

  • Sort/Group by Metadata - some metadata may be preferred over others.

  • Fuzzy Score Sort - sort based on returned fuzzy score.

  • DATE Sort - newer dates are preferred over older dates.

  • Contextual Sort - matches from the same document are preferred over the same marches in different documents.

  • Hit Frequency Sort - matches with higher usage counts are preferred over those with lower counts.

  • Doc to Doc Score Sort - matches that come from documents with higher doc to doc scores are preferred over those with lower scores.

  • User defined Sort - the user defines a set of metadata that should be preferred over another.

  • Server vs Local TM Sort - a preference could be given to matches that come from the server (or local TM).

  • Completeness Sort - Matches are preferred if they have completed more workflow steps than another. For example, a match that has undergone review has a higher priority than one that has only recently been translated.

  • Version Sort - a certain version number may be preferred over another. Usually newer versions are preferred, but not always.

TM Metadata

There is no limit to the kinds of metadata the TM can have. The list here is only a sampling.

All metadata must be typed (integer, float, date, string etc.) so that queries can do the proper type checking.

  • Project ID - assigned to each document or TU/TUV - a globally unique id which represents the project.

  • File Path - having a file path can sometimes be useful in context based searches.

  • ALL TMX Metadata - original translator, creation date, change date (see TMX spec).

  • Customer ID - an ID which helps organize the TM based on customers.

  • Domain Keywords - Keywords that apply to the domain of any document, TU or TUV. For example, "web", "plumbing" , "legal" etc..

  • annotations - general Okapi annotations of any type.

Updated