XML Stream Filter: Characters in code finder rules unescaped in output.

Issue #921 open
Dale Eggett created an issue

Tested in M38 and 1.39.0.

If a “special” character, e.g. the less than symbol, is found inside a code finder rule, it is unescaped in the output.

Source text:

Translated text:

Attaching Rainbow kits used to test, as well as a simple filter with code finder rules and the source XML file.

Comments (8)

  1. Jim Hargrave (OLD)

    Problem is the code to make this work properly was commented out to fix another bug. I’m going to try to uncomment the code and test to see what else breaks.

    //TODO: This must be put back to fix issue 431
    // but the encoder is null in several test cases
                List<Code> codes = text.getCodes();
                for ( Code code : codes ) {
                    // Escape the data of the new inline code (and only them)
                    if ( code.getType().equals(InlineCodeFinder.TAGTYPE) ) {                                        
                        //code.setData(encoder.encode(code.getData(), EncoderContext.SKELETON));
                        code.setData(encoderManager.encode(code.getData(), EncoderContext.INLINE));
                    }
                }
    

  2. Jim Hargrave (OLD)

    @Brad Ross @Dale Eggett As an update I better understand the issue. The problem is that when we use the regex based codefinder rules we need to escape the result using the appropriate filter/subfilter encoding. This is easy enough by adding the code above. However, there is one exception to this. If our subfilter is part of a cdata section in an xml file we do not want any further escaping. At the point in the code where we do the escaping I can’t distinguish between textunits that come from cdata sections and other cases. So we blindly apply the escaping even in cases where we don’t want it. This the two failing unit tests. I’m trying to figure out the best way to add the above code, while not applying it in the cases where it is not appropriate.

  3. Jim Hargrave (OLD)

    @Brad Ross Discussed this with the Okapi team. They agreed that the only way to solve this is to add a means to pass to subfilters weather the original content should be escaped or not (i.e. cdata shouldn’t be escaped, normal HTML should). How to do this in practice is still not clear as I would like to do it in a generic way that would work across the framework. Still researching and will try some prototypes and come up with a solution.

  4. Jim Hargrave (OLD)

    @Brad Ross @Dale Eggett @YvesS @Chase Tingley @Mihai Nita This is going to be very difficult to fix as the issue is deep in the subfilter. Even in the case where we do fix this I worry about backwards compatibility issues as target files and any saved TM will look different using any “fixed” code. It’s tough to think through all the cases and the code changes will almost certainly invite new bugs.

    Can you tell me the priority level for this? If it’s high I can rethink the strategy and maybe come up with a new approach.

    Note this is a public repo so don’t disclose any private details.

  5. Log in to comment