Any way to tweak OCR? I (capital i) is often detected as | (pipe)
I noticed the SRT files generated from the OCR scan of PGS subtitles often results in I
being detected as |
. Example:
1701
01:51:27,622 --> 01:51:31,207
| got things | can do. | used to make
pretty good grades in high school.
This is probably not a big deal, but it turns out my Samsung TV has a problem with the |
character. It doesn’t render it and shifts the rest of the line down, which covers a bunch of the screen:
I also noticed music notes like ♫
get detected as S
or J
etc. Is there any tweaking we can make to Subler/tesseract to make it more accurate? Even restricting the |
character from being detected seems like it would be a big improvement. I’m currently just find/replacing |
s with I
in the generated .srt as a workaround.
Comments (6)
-
repo owner -
reporter Yes, I even tried downloading a more current one from https://github.com/tesseract-ocr/tessdata and manually replacing the file. Didn’t seem to help at all.
-
I have the same issue. Also, lowercase ell is sometimes used for capital eye along with other predictable issues. Another app I use will OCR PGS subs at basically 100% accuracy, so I took a peek inside to see what the difference is. The other app uses Tesseract LSTM version and the
tessdata_best
data files. Unfortunately, my dev environment is not set up where I can build Subler and test with LSTM to see if accuracy would improve. -
repo owner Next week I will make some changes to expose some Tesseract options so you will be able to test what works the best.
-
reporter Any update on this? Seems there’s been a few releases since, but no improvement in OCR that I can see. As a quick comparison I used https://github.com/Tentacule/PgsToSrt via a docker container and the amount of
|
characters in the resulting SRT was 4. The same source resulted in 320|
in the resulting SRT from Subler. It’s a shame because Subler is really the best interface for macOS. I’d be happy to help test with some pointers. -
repo owner PgsToSrt uses Tesseract 3, and it seems to work better with English VobSub. Unfortunately Subler is already on version 4.
- Log in to comment
Did you download the English OCR data file in Preferences → OCR? I think it helps a bit.