Wiki

Clone wiki

Okapi / ProposedArchitectureBasedOnXQueryXUpdate

Introduction

Localization tools have many users. Some users are free-lancers that may have limited computing resources, on the other extreme are large companies that run L10N workflows on large servers and automate much of their work. In between are the LSP's that may have access to more hardware, but not necessarily a beefed up server.

Yves has pointed out that the tools must be stream-based in order not to load large documents in memory. For some components this strategy will work fine, but other components (including the workbench?) need either a local context or full context (in-memory resources).

How do we design tools that fit all of these use cases?

I believe the answer is an embeddable native XML database using W3C standards such as XQuery and XUpdate.

XML gives us greater flexibility than POJO's as Asgeir pointed out. With a rich XML-based resource format we can easily add any needed extensions. The XML can be rapidly processed in low memory environments using XQuery to iterate over chunks of the resource.

What XML DB and XQuery can do for us

Ways in which we can use the XML Database:

  • Filters: Filters produce a stream of atomic units (IExtractionItem). These units are stored directly to the XML database as separate XML chunks (with all the needed meta data to rebuild a complete resource). Those components that need full context can then read the resource from the XML DB. An Interface IResourceReader could be used to wrap the details of the XQuery code. Those components that don't need full context can (1) read the atomic units directly from the Filter or (2) read the units from the XML DB. Different implementations of IResourceBuilder could be used to hide the details of the XQuery from the components downstream.

  • Workbench: The commercial OXygen XML editor allows loading, editing and saving of XML documents directly from an XML DB. Testing with medium files and Berkley XML DB showed that this process is fast. As fast as direct disk access. The translation workbench could work with atomic units (again IExtractionUnit) in the XML DB. Updates and loading are fast. Files of any size could be handled.

  • Output Formats and t-kit creation: With a XML based resource in the XML DB we could write simple XQuery scripts to directly produce XLIFF, TMX, Wordfast, PO or any other output format we need. XQuery is ideally suited to this kind of processing and conversion.

  • Context Preview: XQuery scripts can be written to produce full or partial context preview in whatever formats we prefer (HTML, DOCX, RTF etc..)

  • TM: We could store our TM using the same resource format produced by the filters. A Lucene index not only stores a document ID but also XPATH expressions to pull out the needed segments and larger contextual units. The TM is now the same as our filtered resource repository! We will be able to reuse large amounts of code to handle our TM updates, search etc... The XML repository becomes more than a TM - it is now a full document database that gives corpus based features (i.e, perfect TM match, full context concordance etc.)

  • Term Database: Terms can be stored as TBX++ units (if indeed we need an enhanced TBX). XQuery is used to assemble glossaries or export to full TBX. The same Lucene indexing techniques used for the TM can be used for the terminology. Links to the original resources can be added to terms so that example passages can be easily retrieved along with the typical term metadata.

  • Storing other artifacts: Most XML databases can store not only XML files, but binary blobs. We can store our original documents in the database along with the extracted "resource" version. T-kits could be stored. All blobs can have attached meta data and can be searched just like the XML.

  • Porting to Server XML databases: By writing much of our resource, TM, t-kit and term handling logic in XQuery customers have the choice to use their own companies XML database in order to scale. XQuery will probably survive Python, Java and other specialized languages (just look at SQL).

  • Avoid resource expensive ORM's: With everything as XML we don't have to worry about using heavy object relational mapping tools such as hibernate. Nor do we worry about ugly and verbose SQL+JDBC. We stick to the standards: XQuery, XUpdate, XSLT, XPATH etc.

There are probably many more items we could add to the list. I don't currently see a downside to this - but that is why I'm posting :-)

Updated