Tikal and Rainbow when used in command line produce escaped characters in the converted XLF

Issue #1355 invalid
Matija Kovač created an issue

Hi,
I’m building a Python app around the Okapi Framework in order to be able to implement it in my project.
I set up a script to use Tikal for converting files into XLF first, only to realize that the output of the conversion escapes the “<“ character on some nested elements, but not all.
I thought it might just be the issue with Tikal, so I added Rainbow to my app too, but the output is the same (in hindsight, of course it is, they’re both using the same OpenXML filter utility).

So in my test .docx file I have some inline formatting to test the conversion process, like some of the words are bold, italic etc.
As expected, this results in embedded subelements of the <source> element in the translation units. All good so far.

However, it looks like the <run> element is always returned as “&lt;run1>", with the left “<“ escaped, while the right is not.
I can of course fix this in python when storing the file, but then it refuses to merge back again.

I also tried with HTML files, where the escaped tags appear in the <sup> element, since there is no <run> element.
<source xml:lang="en">This text is <bpt id="1">&lt;b></bpt>bold<ept id="1">&lt;/b></ept>.</source>

This inconsistent handling of embedded subelements in the XML structure, or rather their tag representation is of course leading to issues down the line when it comes to positioning tags through the MT process, as well as other issues.

Is this expected behavior, and if yes - may I know why and how am I expected to deal with this.
If not, how can it be fixed?

Here’s an example of my code, using Rainbow with the TranslationKitCreation:

def rainbow_convert_to_xlf():

    if 'file' not in request.files:
        return jsonify({'error': 'No file part in the request'}), 400

    file = request.files['file']
    if file.filename == '':
        return jsonify({'error': 'No file selected'}), 400

    if not is_extension_supported(file.filename):
        return jsonify({'error': 'Unsupported file extension'}), 400

    source_lang = request.form.get('source_lang', 'en')  # Default to English if not provided
    target_lang = request.form.get('target_lang', 'en')  # Default to English if not provided

    folder_name = str(uuid.uuid4())
    upload_folder_path = os.path.join(current_app.config['UPLOAD_FOLDER'], folder_name)
    os.makedirs(upload_folder_path, exist_ok=True)
    input_file_path = os.path.join(upload_folder_path, file.filename)
    file.save(input_file_path)

    xlf_filename = file.filename + '.xlf'
    output_file_path = os.path.join(upload_folder_path, xlf_filename)

    rainbow_command = [
        './rainbow/rainbow.sh',
        '-x', 'TranslationKitCreation',
        '-sl', source_lang,
        '-tl', target_lang,
        input_file_path,
        '-o', output_file_path
    ]

    try:
        result = subprocess.run(rainbow_command, check=True, capture_output=True, text=True)
        print("Rainbow Command Output:", result.stdout)  # Debug output stdout
        print("Rainbow Command Error:", result.stderr)  # Debug output stderr
        return jsonify({
            'message': 'File converted to XLF successfully',
            'xlf_file_url': f'http://{request.host}/get/{folder_name}/{xlf_filename}'
        }), 200
    except subprocess.CalledProcessError as e:
        current_app.logger.error(f"Rainbow CLI failed: {e.stderr}")
        return jsonify({
            'error': 'Failed to convert file',
            'message': e.stderr
        }), 500

Many thanks for building all these amazing tools and for helping me out with this issue.

Comments (2)

  1. jhargrave-straker

    The XML produced is valid XML. It’s not necessary to escape “>“. So this is correct behavior. If you read the content with an XML parser before you send it to MT it should give you the correct text. Okapi uses an internal non-xml representation (TextFragment). The xliff is just one possible output.You can take a look at the Okapi MT connectors and how they use TextFragment to get some ideas how to post-process the xliff.

  2. Matija Kovač reporter

    Thanks for getting back to me. This is unfortunate as I can’t really see any benefits from not using XML here, and I can see the underlying code is dealing with workarounds for escaping and un-escaping the text to html encoded and back when dealing with MT integrations.
    I noticed a workaround would be to use the <g id> notation as it’s also one of the options in the Okapi client app.
    This works but I’m having issues invoking the custom pipeline via command line. I’ll post a separate issue regarding that, but thanks anyway for your reply.

  3. Log in to comment