Any way to tweak OCR? I (capital i) is often detected as | (pipe)

Issue #548 new
__ created an issue

I noticed the SRT files generated from the OCR scan of PGS subtitles often results in I being detected as |. Example:

1701
01:51:27,622 --> 01:51:31,207
| got things | can do. | used to make
pretty good grades in high school.

This is probably not a big deal, but it turns out my Samsung TV has a problem with the | character. It doesn’t render it and shifts the rest of the line down, which covers a bunch of the screen:

I also noticed music notes like get detected as S or J etc. Is there any tweaking we can make to Subler/tesseract to make it more accurate? Even restricting the | character from being detected seems like it would be a big improvement. I’m currently just find/replacing |s with I in the generated .srt as a workaround.

Comments (6)

  1. Damiano Galassi repo owner

    Did you download the English OCR data file in Preferences → OCR? I think it helps a bit.

  2. John West

    I have the same issue. Also, lowercase ell is sometimes used for capital eye along with other predictable issues. Another app I use will OCR PGS subs at basically 100% accuracy, so I took a peek inside to see what the difference is. The other app uses Tesseract LSTM version and the tessdata_best data files. Unfortunately, my dev environment is not set up where I can build Subler and test with LSTM to see if accuracy would improve.

  3. Damiano Galassi repo owner

    Next week I will make some changes to expose some Tesseract options so you will be able to test what works the best.

  4. __ reporter

    Any update on this? Seems there’s been a few releases since, but no improvement in OCR that I can see. As a quick comparison I used https://github.com/Tentacule/PgsToSrt via a docker container and the amount of | characters in the resulting SRT was 4. The same source resulted in 320 | in the resulting SRT from Subler. It’s a shame because Subler is really the best interface for macOS. I’d be happy to help test with some pointers.

  5. Damiano Galassi repo owner

    PgsToSrt uses Tesseract 3, and it seems to work better with English VobSub. Unfortunately Subler is already on version 4.

  6. Log in to comment