Any way to tweak OCR? I (capital i) is often detected as | (pipe)

I noticed the SRT files generated from the OCR scan of PGS subtitles often results in I being detected as |. Example:

1701
01:51:27,622 --> 01:51:31,207
| got things | can do. | used to make
pretty good grades in high school.

This is probably not a big deal, but it turns out my Samsung TV has a problem with the | character. It doesn’t render it and shifts the rest of the line down, which covers a bunch of the screen:

I also noticed music notes like ♫ get detected as S or J etc. Is there any tweaking we can make to Subler/tesseract to make it more accurate? Even restricting the | character from being detected seems like it would be a big improvement. I’m currently just find/replacing |s with I in the generated .srt as a workaround.

Comments (6)