o:gfxdata tag results in loop taking too long to process (char by char)

Issue #351 resolved
Former user created an issue

Original issue 351 created by mrh... on 2013-07-11T18:42:42.000Z:

From email:

I have a DOCX file which has a o:gfxdata tag. The OpenXMLContentFilter is caught in an infinite loop in the run method of the thread created in the combineRepeatedFormat method. The read keeps returning the same cbuf value.

public void run()
{
try
{
while((n=br.read(cbuf,0,512))!=-1)
{
for(i=0;i<n;i++)
{
handleOneChar(cbuf[i]);
}
}

The document I am processing cannot be released to the public, so I am trying to create an example for this group.

Comments (7)

  1. Former user Account Deleted

    Comment 1. originally posted by mrh... on 2013-07-11T18:42:57.000Z:

    I tried this with 0.22-SNAPSHOT

  2. Former user Account Deleted

    Comment 2. originally posted by mrh... on 2013-07-11T18:45:42.000Z:

    The issue is not an endless loop. In fact this document takes 17 min to process because all the data in the image gets processed one character at a time.
    My original document had mutliple documents. I did leave my original all night and it still didn't finish extracting.

    Perhaps some work needs to be put into o:gfxdata tags, so that the data portion is skipped and not processed one at a time. This will speed up extraction.

  3. Former user Account Deleted

    Comment 3. originally posted by mrh... on 2013-07-11T18:55:41.000Z:

    Correction, "My original document had mutliple documents" should read "My original document had mutliple images (30+)".

  4. Former user Account Deleted

    Comment 5. originally posted by @ysavourel on 2013-11-14T00:39:36.000Z:

    Running in tikal (tikal.sh -fc okf_openxml -x neverending.docx) only takes me 36s, but 90+% of the time is spent in the method mrhcon identified. At least half of that is OpenXMLContentFilter line 383:

                      curtag = curtag + c;
    

    That's a simple "use a StringBuilder instead" problem. I will work up a patch.

  5. Log in to comment