openxml (docx) crashing with javax.xml.stream.XMLStreamException: Maximum attribute size limit (2097152) exceeded

Issue #974 resolved
Jim Hargrave (OLD) created an issue

I’ll try to get a sample file to reproduce the problem. We did find an attribute with an obscenely long value. Strange that the file works with M38/M39. Did we change our xml processor version in the openxml filter? If so they have better error checking for these pathological cases.

For now here is the stack trace:

Caused by: javax.xml.stream.XMLStreamException: Maximum attribute size limit (2097152) exceeded
    at com.ctc.wstx.sr.StreamScanner.constructLimitViolation(StreamScanner.java:2483) ~[woodstox-core-6.1.1.jar:6.1.1]
    at com.ctc.wstx.sr.StreamScanner.verifyLimit(StreamScanner.java:2476) ~[woodstox-core-6.1.1.jar:6.1.1]
    at com.ctc.wstx.sr.BasicStreamReader._checkAttributeLimit(BasicStreamReader.java:2053) ~[woodstox-core-6.1.1.jar:6.1.1]
    at com.ctc.wstx.sr.BasicStreamReader.parseAttrValue(BasicStreamReader.java:2038) ~[woodstox-core-6.1.1.jar:6.1.1]
    at com.ctc.wstx.sr.BasicStreamReader.handleNsAttrs(BasicStreamReader.java:3144) ~[woodstox-core-6.1.1.jar:6.1.1]
    at com.ctc.wstx.sr.BasicStreamReader.handleStartElem(BasicStreamReader.java:3042) ~[woodstox-core-6.1.1.jar:6.1.1]
    at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2920) ~[woodstox-core-6.1.1.jar:6.1.1]
    at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1122) ~[woodstox-core-6.1.1.jar:6.1.1]
    at com.ctc.wstx.evt.WstxEventReader.nextEvent(WstxEventReader.java:283) ~[woodstox-core-6.1.1.jar:6.1.1]
    at net.sf.okapi.filters.openxml.PrioritisedXMLEventReader.nextEvent(PrioritisedXMLEventReader.java:57) ~[okapi-filter-openxml-1.40.0.jar:na]
    at net.sf.okapi.filters.openxml.SkippableElements$Default.skip(SkippableElements.java:131) ~[okapi-filter-openxml-1.40.0.jar:na]
    at net.sf.okapi.filters.openxml.SkippableElements$Inline.skip(SkippableElements.java:168) ~[okapi-filter-openxml-1.40.0.jar:na]
    at net.sf.okapi.filters.openxml.RunSkippableElements.skip(RunSkippableElements.java:76) ~[okapi-filter-openxml-1.40.0.jar:na]
    at net.sf.okapi.filters.openxml.RunParser.parseSkippableElements(RunParser.java:416) ~[okapi-filter-openxml-1.40.0.jar:na]
    at net.sf.okapi.filters.openxml.RunParser.startRunParsing(RunParser.java:196) ~[okapi-filter-openxml-1.40.0.jar:na]
    at net.sf.okapi.filters.openxml.RunParser.parse(RunParser.java:165) ~[okapi-filter-openxml-1.40.0.jar:na]
    at net.sf.okapi.filters.openxml.BlockParser.processRun(BlockParser.java:297) ~[okapi-filter-openxml-1.40.0.jar:na]
    at net.sf.okapi.filters.openxml.BlockParser.parse(BlockParser.java:230) ~[okapi-filter-openxml-1.40.0.jar:na]
    at net.sf.okapi.filters.openxml.StyledTextPart.process(StyledTextPart.java:241) ~[okapi-filter-openxml-1.40.0.jar:na]
    at net.sf.okapi.filters.openxml.StyledTextPart.open(StyledTextPart.java:207) ~[okapi-filter-openxml-1.40.0.jar:na]
    at net.sf.okapi.filters.openxml.StyledTextPart.open(StyledTextPart.java:129) ~[okapi-filter-openxml-1.40.0.jar:na]
    at net.sf.okapi.filters.openxml.OpenXMLFilter.nextInDocument(OpenXMLFilter.java:446) ~[okapi-filter-openxml-1.40.0.jar:na]
    at net.sf.okapi.filters.openxml.OpenXMLFilter.next(OpenXMLFilter.java:256) ~[okapi-filter-openxml-1.40.0.jar:na]
    ... 49 common frames omitted

Comments (18)

  1. Jim Hargrave (OLD) reporter

    In "word/document.xml", there is a <v:group> element with an “o:gfxdata” attribute that is super long: 3576823 characters without unescaping the XML escape sequences within it, and 3400187 with unescaping them.

  2. Chase Tingley

    I could have sworn we’d fixed this before, but I think it was in the IDML filter – see commit b15d9327e.

    I don’t remember intentionally changed the XML parser, but we’ve had problems before the one we’re using changes underneath us because of SPI discovery/classpath issues. That might have happened here.

  3. Jim Hargrave (OLD) reporter

    @Chase Tingley I saw that the woodstock xml processor version was bumped a while back. Possible woodstock added this check? I’ll tell the team about the classpath issues and investigate on our side.

  4. Denis Konovalyenko

    @Jim Hargrave (OLD) , there is net.sf.okapi.filters.openxml.OpenXMLFilter#MAX_ATTRIBUTE_SIZE constant (2 * 1024 * 1024), which affects the maximum allowed attribute size, as far as I can see:

            if (inputFactory.isPropertySupported(WstxInputProperties.P_MAX_ATTRIBUTE_SIZE)) {
                inputFactory.setProperty(WstxInputProperties.P_MAX_ATTRIBUTE_SIZE, MAX_ATTRIBUTE_SIZE);
            }
    

    I think the best solution would be to reflect the IDMLFilter behaviour - when this value comes from filter parameters as @Chase Tingley mentioned before.

  5. Jim Hargrave (OLD) reporter

    @Chase Tingley @Denis Konovalyenko would everyone be ok if we simply increased the limit? Would this work (3 * 1024 * 1024)

  6. Denis Konovalyenko

    @Jim Hargrave (OLD) , it seems to me that it would not work for your case (the original attribute value was 3576823 bytes long). I would recommend sticking to 4 MiB at least (4 * 1024 * 1024 = 4194304) then.

  7. Devesh kumar

    @Jim Hargrave (OLD) @Denis Konovalyenko

    I am getting the below error even after trying with version 1.41.0 which was released after this fix (Issue #974)
    Error:

    javax.xml.stream.XMLStreamException: Maximum attribute size limit (4194304) exceeded

    Help Please . Thank You

  8. Jim Hargrave

    Seems we have already hit the new limit which was 4x the old. I think we should leave this open ended (not check for size) and let memory dictate what is possible. This does seem to be a bug in OpenXml/Word as this has to cause a problem with other tools at some point. I’d bet there is a bug logged with Microsoft on this already.

  9. Chase Tingley

    @Jim Hargrave Woodstox sets a default value (which is smaller than what we currently set). I can’t find documentation that indicates if setting it to 0 disables the check completely. (Disabling the check also makes me uneasy, safety-wise.) I think we should just expose the value through the filter config like we do for IDML.

  10. Denis Konovalyenko

    @devesh kumar , I agree with @Jim Hargrave - adjusting the maxAttributeSize filter parameter value to a reasonable to you one (more than 4194304 at the moment) should work.

  11. Devesh kumar

    Thanks all for your response on this,

    @Denis Konovalyenko since i am using the okapi, the class were the maxAttributeSize was set to have a 4mib size (in PR #447 ) comes as a decompiled class for me and is read-only.

    plus this maxAttributeSize is a variable thing and increasing the size more than 4194304 may only work for now.

  12. Chase Tingley

    I forgot that Denis had already added the parameter for this in #974, there’s just no UI for it.

    Devesh – you don’t need to decompile anything. You just need to make a custom filter configuration for the OpenXML filter in Rainbow, then open the .fprm file in a text editor and change the value of the maxAttributeSize attribute.

    We are not going to remove the restriction entirely - doing so allows for server code running this filter to be DOSed.

  13. Devesh kumar

    @Chase Tingley is it something like this that you want me to write ?

    I found it in the okapiFilterFactory.java class

        private static RegexFilter getSRTFilter() {
            RegexFilter filter = new RegexFilter();
            try {
                net.sf.okapi.filters.regex.Parameters params = (net.sf.okapi.filters.regex.Parameters) filter.getParameters();
                String config = IOUtils.toString(OkapiFilterFactory.class.getResourceAsStream(OKAPI_CUSTOM_CONFIGS_PATH + "okf_regex@srt.fprm"), "UTF-8");
                params.fromString(config);
            } catch (IOException e) {
                System.err.println("Strings custom configuration could not be loaded");
            }
            return filter;
        }
        ```
    

    this is the class (okapiFilterFactory.java) where i could see and edit variables likes

    public static final String XML_CONFIG_FILENAME = "okf_xmlstream-custom.fprm";

    and these *.fprm files are present in /resources/okapi/configurations

  14. Di Hu

    Hi, I have some difficulties with the maxAttrubuteSize config. I found the maxAttrubuteSize is read but the value is not passed to P_MAX_ATTRIBUTE_SIZE by the code below in OpenXMLFilter.java. So when do extraction, the exception

    “Caused by: javax.xml.stream.XMLStreamException: Maximum attribute size limit (2097152) exceeded” still exists.

    Can you help identify if anything is wrong? Thank you so much!

    setPropertyIfSupported(inputFactory, WstxInputProperties.P_MAX_ATTRIBUTE_SIZE, conditionalParameters.getMaxAttributeSize());
    

    I’ve done the following:

    1. Add a fprm file okf_openxml@maxAttrSize.fprm containing maxAttrubuteSize.i=33333333
    2. Created a CustomOpenXMLFilterConfiguration.java
    public class CustomOpenXMLFilterConfiguration {
        public static final String CUSTOM_OKAPI_FILTER_ID = "okf_openxml@maxAttrSize";
        private static final String OKAPI_OPENXML_FILTER_CLASS = "net.sf.okapi.filters.openxml.OpenXMLFilter";
        private static final String CONFIG_FILE_LOCATION = "/resources/okf_openxml@maxAttrSize.fprm";
        private static final String OPENXML_EXTENTIONS = ".docx;.docm;.dotx;.dotm;.pptx;.pptm;.ppsx;.ppsm;.potx;.potm;" +
            ".xlsx;.xlsm;.xltx;.xltm;.vsdx;.vsdm;";
    
        public static net.sf.okapi.common.filters.FilterConfiguration provideCustomOpenXMLFilterConfiguration() {
            return new FilterConfiguration(
                CUSTOM_OKAPI_FILTER_ID,
                MimeTypeMapper.XML_MIME_TYPE,
                OKAPI_OPENXML_FILTER_CLASS,
                "OPENXML (Customize MaxAttributeSize)",
                "Customize MaxAttributeSize",
                CONFIG_FILE_LOCATION,
                OPENXML_EXTENTIONS);
        }
    }
    

    3. In FilterConfiguration.java, Add the customized config to FILTER_CONFIGURATION_MAPPER, add the new filter_ID to EXTENSIONS_MAP.

    FILTER_CONFIGURATION_MAPPER.addConfiguration(CustomOpenXMLFilterConfiguration.provideCustomOpenXMLFilterConfiguration());
    
    public static final ImmutableMap<FileContentType, String>
        EXTENSIONS_MAP = new ImmutableMap.Builder<FileContentType, String>()
        .put(FileContentType.HTML, CustomHTMLFilterConfiguration.CUSTOM_OKAPI_FILTER_ID)
        .put(FileContentType.XLIFF, "okf_xliff")
        .put(FileContentType.MOSES_TEXT, "okf_mosestext")
        .put(FileContentType.DOCX, CustomOpenXMLFilterConfiguration.CUSTOM_OKAPI_FILTER_ID)
        .put(FileContentType.XLSX, CustomOpenXMLFilterConfiguration.CUSTOM_OKAPI_FILTER_ID)
        .put(FileContentType.PPTX, CustomOpenXMLFilterConfiguration.CUSTOM_OKAPI_FILTER_ID)
        .put(FileContentType.TMX, "okf_tmx")
        .build();
    

    Thanks in advance for the effort!!

  15. Log in to comment