Get word and page counts from ep header, when available, for metadata indexing.

Issue #69 resolved
Philip Burns created an issue

Currently the metadata indexer counts words and pages from <pb> and <w> elements in each document. The indexer should get the counts from the ep header when available, resulting in a significant reduction in indexing time.

Comments (4)

  1. Martin Mueller

    Is this an eXist bug we should report to them? You said in an earlier email that Java would handle this case, but eXist as a Java app does no.t

  2. Philip Burns reporter

    The word and page counts is our (my) implementation issue. It works
    fine, it's just slow if the code has to scan the entirety of each document to count the <pb> and <w> elements. If instead the counts resides in the ep header, pulling the values out of there is much faster.

    The issue with the number formatting may be a bug or a matter of
    unclear documentation. Our case is simple -- we want integer values to display as integers -- and I have a fix for that. If we wanted a fancier display, we'd have to look into the issue further.

    -- Philip R. "Pib" Burns Academic Software Development Northwestern University, Evanston, IL. USA pib@northwestern.edu

  3. Craig Berry

    Resolved by:

    #!
    
    commit bbe1bd29aa709d53861c070e2c5143ca013707d5 (HEAD -> master, origin/master, origin/HEAD)
    Author: Philip R. Burns <pib@northwestern.edu>
    Date:   Thu Apr 12 16:30:24 2018 -0500
    
        Improve indexing of page and word counts.
    
        If we have it precalculated in the xenodata, just use that instead
        of the much slower count done here.
    
        Also, force the counts to be integers so they don't get displayed
        in scientific notation.
    
  4. Log in to comment