PatternSyntaxException in MIF filter

Issue #965 resolved
Alessandro Falappa created an issue

While converting a MIF file I got the following exception:

java.util.regex.PatternSyntaxException: Illegal repetition near index 529
(^[A-Z]{1}:)|()|(\\t)|(<[naArR ]{1}[+]*\>)|(<[naArR]{1}=[0-9]+\>)|(<\$.*?>)|(<Default  Font\>)|(<(zenkaku|kanji|full-width|chinese|Indic|Farsi|Hebrew|Abjad|Alif Ba Ta|Thai) [naA]{1}[+]*\>)|(<(zenkaku|kanji|full-width|chinese|Indic|Farsi|Hebrew|Abjad|Alif Ba Ta|Thai) [naA]{1}=[0-9]+\>)|(<(kanji kazu|daiji|hira iroha|kata iroha|hira gojuon|kata gojuon)[+]*\>)|(<(kanji kazu|daiji|hira iroha|kata iroha|hira gojuon|kata gojuon)=[0-9]+\>)|(<Superscript\>)|(<Fixed_Font\>)|(<Regular\>)|(<link\>)|(<Italic\>)|(<Bold\>)|(<Menue\>)|(<{Wingdings_3}\>)|(<{Symbol}\>)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 ^
        at java.util.regex.Pattern.error(Pattern.java:1969)
        at java.util.regex.Pattern.closure(Pattern.java:3171)
        at java.util.regex.Pattern.sequence(Pattern.java:2148)
        at java.util.regex.Pattern.expr(Pattern.java:2010)
        at java.util.regex.Pattern.group0(Pattern.java:2919)
        at java.util.regex.Pattern.sequence(Pattern.java:2065)
        at java.util.regex.Pattern.expr(Pattern.java:2010)
        at java.util.regex.Pattern.compile(Pattern.java:1702)
        at java.util.regex.Pattern.<init>(Pattern.java:1352)
        at java.util.regex.Pattern.compile(Pattern.java:1054)
        at net.sf.okapi.common.filters.InlineCodeFinder.compile(InlineCodeFinder.java:146)
        at net.sf.okapi.filters.mif.MIFFilter.open(MIFFilter.java:256)
        at net.sf.okapi.filters.mif.MIFFilter.open(MIFFilter.java:197)
        at net.sf.okapi.steps.common.RawDocumentToFilterEventsStep.handleEvent(RawDocumentToFilterEventsStep.java:132)
        at net.sf.okapi.common.pipeline.Pipeline.execute(Pipeline.java:117)
        at net.sf.okapi.common.pipeline.Pipeline.process(Pipeline.java:227)
        at net.sf.okapi.common.pipeline.Pipeline.process(Pipeline.java:199)
        at net.sf.okapi.common.pipelinedriver.PipelineDriver.processBatch(PipelineDriver.java:182)

I cannot disclose the file unfortunately, but the stacktrace appears to point to the offending code.

Comments (8)

  1. Alessandro Falappa reporter

    the filter was created as follows:

    MIFFilter filter = new MIFFilter();
    net.sf.okapi.filters.mif.Parameters params = filter.getParameters();
    params.setExtractBodyPages(true);
    params.setExtractHiddenPages(false);
    params.setExtractIndexMarkers(false);
    params.setExtractLinks(false);
    params.setExtractMasterPages(false);
    params.setExtractPgfNumFormatsInline(false);
    params.setExtractReferencePages(false);
    params.setExtractVariables(false);
    

    I haven’t explicitly enabled code finder rules.

  2. Alessandro Falappa reporter

    What it puzzles me is that, looking at the code, I cannot tell where the {Wingdings_3} rule comes from.

    The last two rules should have their braces escaped: \\{Wingdings_3\\} and \\{Symbol\\}.

  3. Alessandro Falappa reporter

    Found the culprit.

    The problem is in net.sf.okapi.filters.mif.FontTags#toInlineCodeFinderRules() where font names are used to build code finder rules.

    Apparently font names may contain reserved regular expression chars.

  4. Log in to comment