Get word and page counts from ep header, when available, for metadata indexing.
Currently the metadata indexer counts words and pages from <pb> and <w> elements in each document. The indexer should get the counts from the ep header when available, resulting in a significant reduction in indexing time.
Comments (4)
-
-
reporter The word and page counts is our (my) implementation issue. It works
fine, it's just slow if the code has to scan the entirety of each document to count the <pb> and <w> elements. If instead the counts resides in the ep header, pulling the values out of there is much faster.The issue with the number formatting may be a bug or a matter of
unclear documentation. Our case is simple -- we want integer values to display as integers -- and I have a fix for that. If we wanted a fancier display, we'd have to look into the issue further.-- Philip R. "Pib" Burns Academic Software Development Northwestern University, Evanston, IL. USA pib@northwestern.edu
-
integers are good enough. We have lots of other things on our plage.
-
- changed status to resolved
Resolved by:
#! commit bbe1bd29aa709d53861c070e2c5143ca013707d5 (HEAD -> master, origin/master, origin/HEAD) Author: Philip R. Burns <pib@northwestern.edu> Date: Thu Apr 12 16:30:24 2018 -0500 Improve indexing of page and word counts. If we have it precalculated in the xenodata, just use that instead of the much slower count done here. Also, force the counts to be integers so they don't get displayed in scientific notation.
- Log in to comment
Is this an eXist bug we should report to them? You said in an earlier email that Java would handle this case, but eXist as a Java app does no.t