Random characters in hidden text added to index entries

Issue #226 invalid
jsbien created an issue

Looks like some semi-random text is sometimes (definitely too often...) added to the text used to prefill the entry field, cf. the screenshot.

I'm unable to make a screenshot with the underlying text, PrtScr does not work for me if a pop-up window is open.

I never was happy with the present way of displaying the hidden text, and making it more convenient seems a prerequisite to diagnose the problem.

Comments (22)

  1. Michał Rudolf repo owner

    I see some random text sometimes, but I assumed this was not a bug in the app, but in the hOCR.

  2. jsbien reporter

    At first I also assumed that the strange string is just some OCR errors, but now it seems there are some else (but more experiments are needed).

  3. jsbien reporter

    Sometimes, but not alway, the character come from a line below or above, what is theoretically correct, but not convenient.

  4. jsbien reporter

    I confirm that the characters come from a neigbouring line, but only those which are high enough in case of a lower line. The problem is to be discussed in due time on Skype.

  5. Michał Rudolf repo owner

    I feel the problem is that just a few pixels from lines above and below are enough to grab all the hidden text from those lines. And it is hard to mark precisely, because the lines often overlap (or almost overlap).

    I made a minor adjustment to the rectangle used for grabbing hidden text, decreasing it's top and bottom coordinates. Please check if it is easier to avoid the problem now.

  6. jsbien reporter

    This is bug! It selects only those letters from the line below or above which partially are included in the marked area. Th text should be copied like in djview4, preserving the text structure.

  7. jsbien reporter
    • changed status to open

    I hope this is a simple correction. Perhaps the code from djview4 can be used directly? It seems to require a conceptually different approach, not just some adjustments.

  8. Michał Rudolf repo owner

    What do you mean by preserving the text structure? How does it work in djview4?

    I just retrieve the text which is returned for given region, is there any other way to do it?

  9. jsbien reporter

    I can of course demonstrate on Skype how it works in djview4, but is it not visible in the code?

  10. Joachim Aleszkiewicz

    OK, that took me some time, but here it is: test version.

    The problem here is that DjVu files can have different structure of hidden text. In Linde every character is separate, in other files entire words exist as single entity, and in one (don't remember which) 1131 characters are squished together in a single entity. The behavior you've seen in DjView4 seems to occur only when characters aren't separate.

    Anyway, in this build I've tried to select all characters in-between selected characters, so selection like this:

    SelectionSample.png

    results in Flegetonta w gó being copied. However, it is possible that it broke selection in some documents, so that definitely requires testing. For ease of testing, there is also a new option under View > Show text layer.

    My repo is here, build is based on 571a8bf

  11. jsbien reporter

    Thank you very much for the analysis - now the behaviour of the program is logically explained.

    Yes, the hidden text intentionally has characters as the lowest entity as it seemed potentially useful. It was not easy and required an extension of hOCR format made by Wilk.

    As for 1131 characters as one entity, the Tesseract output is a great disappointment, but the real culprit is inappropriate binarization, which causes e.g. speckles to be recognized as characters. Something should be done with this in the future. Some kind of structure validation is definitely needed.

    The option to show the text layer is definitely very useful, it was actually an old issue not solved fully satisfactionary (#21, #229). It is interesting how the page structure is shown. To say the truth, I don't understand what is going on. It seems that in the case of character level structure the word level is ignored and the line level is displayed. It is taken from the file or computed on the fly? Please note that pressing Shift also displays word/lines, not single characters.

    It would be nice to be able to see the whole structure, Bottou didn't want to implement it but gave some instructions: https://sourceforge.net/p/djvu/discussion/103286/thread/77392e32/?limit=25#ca7d

    I've just done some quick tests which seem to confirm my view that the behaviour of djview4 and djview4poliqarp is different. My suspicion is that djview4 is never copying the characters but words (from word or line structure). I would definitely prefer a solution which is compatible with djview4.

    As for testing, I will need a merge of both branches, as there are some essential changes in the main branch. There is of course no hurry, as it's holidays time :-)

  12. Joachim Aleszkiewicz

    As far as I can tell, I'm using latest commit from main branch, so those changes are included.

    The structure in DjVu file is hierarchical; 'page' is divided into 'column's, then 'region's, 'para'[graph]s, 'line's, 'word's, and finally 'char's. Of course, depending on OCR any of these nodes might contain one or more sub-nodes, or be terminal and contain text.

    After loading page, it is flattened using flatten_hiddentext, which... Well, flattens it. All that's left is a list of characters/words (with their positions) separated by keywords like 'word' and 'line'.

    Extracting text is done in two ways:

    • Pressing SHIFT calls getTextForPointer, which
      1. finds (first) character under cursor (as q)
      2. gets text starting from separator before q ending on q (exclusive)
      3. gets text from q
      4. gets text starting from q ending on separator stronger than word (exclusive)
      5. returns 2, 3, 4 - this is later formatted as "2[3]4"
    • Copy text for selection uses getTextForRect. This is the function I've changed. The original here is identical to djview4's version:
      1. set separator to char
      2. while list not empty, get next object from list
        • if object contains text:
          • if separator is word append ' ' to return value, if line/para/region append newline, if page append newpage
          • append text from object to return value
        • else set separator to object

    Seeing as those functions are identical to those in djview4, I don't see how djview4 can behave differently from djview-poliqarp. If you could point out in which file/page, preferably with link to selected area... It wouldn't actually surprise me if there was some tag saying 'hey, I'm a smart document, so selection here will be done by this formula written in encoded MathML'. I hope there isn't, but it wouldn't surprise me.

    My thoughts on this: It would be possible, but extremally hard to implement text selection like in Adobe Acrobat. It would be possible, and perhaps worthwhile to add highlight to area where selected text is. If selecting too much is a problem, my algorithm could be modified to mark for selection only characters which are covered by selection in at least 50% - but commas would probably be caught anyway. Maybe some sort of preview before copy?

  13. jsbien reporter

    Thanks for clarification. My conclusion is:

    For index entries getTextForPointer would be much better than getTextForRect, which I understand is used now.

    Your modifcation of getTextForRect is still useful for 'copy text' in the pop-up menu.

    BTW, I would appreciate very much a quick switching between hiddent text and graphics, e.g. a shortcut key.

    Some comments:

    There is definitely no hidden features like MathML, the format sticks to the relatively short and simple specification.

    I will check again in more detail into the behaviour of djview4 but it's not now relevant to this issue.

    It's not clear for me what "text selection like in Adobe Acrobat" means, but we can come back to the topic by mail e.g. after holidays.

    As for my confusion about the version, I was mislead by a server related message which I don't remember and cannot reproduce.

  14. jsbien reporter

    I'm afraid the change to getTextForRect is to be reverted. Page 425 of volume 1: when I mark the rectangle 'szwiec" in djview4, I get "szwiec,". When I mark it in djview4poliqarp, I get "skóry, jako szwiec, garbarz, Graec. Stya, ; Lat.".

  15. Log in to comment