o:gfxdata tag results in loop taking too long to process (char by char)
Original issue 351 created by mrh... on 2013-07-11T18:42:42.000Z:
From email:
I have a DOCX file which has a o:gfxdata tag. The OpenXMLContentFilter is caught in an infinite loop in the run method of the thread created in the combineRepeatedFormat method. The read keeps returning the same cbuf value.
public void run()
{
try
{
while((n=br.read(cbuf,0,512))!=-1)
{
for(i=0;i<n;i++)
{
handleOneChar(cbuf[i]);
}
}The document I am processing cannot be released to the public, so I am trying to create an example for this group.
Comments (7)
-
Account Deleted -
Account Deleted Comment 2. originally posted by mrh... on 2013-07-11T18:45:42.000Z:
The issue is not an endless loop. In fact this document takes 17 min to process because all the data in the image gets processed one character at a time.
My original document had mutliple documents. I did leave my original all night and it still didn't finish extracting.Perhaps some work needs to be put into o:gfxdata tags, so that the data portion is skipped and not processed one at a time. This will speed up extraction.
-
Account Deleted Comment 3. originally posted by mrh... on 2013-07-11T18:55:41.000Z:
Correction, "My original document had mutliple documents" should read "My original document had mutliple images (30+)".
-
Account Deleted Comment 4. originally posted by @ysavourel on 2013-08-17T13:17:36.000Z:
-
Account Deleted Comment 5. originally posted by @ysavourel on 2013-11-14T00:39:36.000Z:
Running in tikal (tikal.sh -fc okf_openxml -x neverending.docx) only takes me 36s, but 90+% of the time is spent in the method mrhcon identified. At least half of that is OpenXMLContentFilter line 383:
curtag = curtag + c;
That's a simple "use a StringBuilder instead" problem. I will work up a patch.
-
Account Deleted - changed status to resolved
Comment 6. originally posted by @ysavourel on 2013-11-14T04:37:56.000Z:
Fixed on dev, commit 11cb1ffdaf4bc2eb2fb383feb24af4c467658c16.
A roundtrip of this file (filter + merge) went from about 50 seconds to < 2 on my machine.
-
Account Deleted Comment 7. originally posted by @ysavourel on 2013-11-14T05:01:28.000Z:
Great! Thanks.
- Log in to comment
Comment 1. originally posted by mrh... on 2013-07-11T18:42:57.000Z:
I tried this with 0.22-SNAPSHOT