Chinese Simplified nor Chinese Traditional Subtitles are not converted correctly

Issue #12 closed
Former user created an issue

As title neither Chinese Simplified nor Chinese Traditional Subtitles are converted correctly.

Tesseract .traindata is in the correct folder (i.e. Korean works perfectly).

Chinese convertions come out as:

#$[11!==a~~ (for example).

I guess that chi_sim.traineddata (for simplified) or chi_trad.traineddata (for traditional) files are not correctly recognized. Would be great to fix.

Comments (11)

  1. Damiano Galassi repo owner

    The languages code standard used by mp4 there isn't a distinction between traditional and simplified Chinese. So I don't know which one of the two trained data file would work… you can try to rename one to zho.traineddata and see what happens.

  2. m2m

    First of all thanks for looking into the issue. Renaming either of them to zho.traineddata works. That means the one renamed is used to "ocr" the text. Which is step in the right direction.

    My I suggest to update the logic so that per default chi_sim.traineddata is used in case of Chinese (zho) ?

    Ultimatley I guess its a shortcoming of the ID3v2 spec, I guess which (wrongly) uses the ISO-639-2/T language code to identify the subtitle (which does not differe between simplified and traditional). Wrongly because ISO-639-2/T specifies spoken languages - but not its writing form.

    Anyway maybe this http://www.loc.gov/standards/iso639-2/faq.html#24 also helps. Maybe if I have the time I will try to find a traditional and simplified itunes movie to find out how Apple implemnted it.

  3. Damiano Galassi repo owner

    Apple introduced an extended language tag, but I have yet to implement it. For now I'll change it to select the chi_sim.traineddata file.

  4. m2m

    Thanks. As reference in the orignal MKV Subtitles are as follows:

    Subtitle 1: English

    Subtitle 2: Chinese - Traditional (繁體)

    Subtitle 3: Chinese - Simplified (简体)

    Subtitle 4: Chinese - Traditional (繁體)

    Subtitle 5: Korean

    繁體 / 简体 is how different subtitles are named on for example Chinese DVD / bluray players.

    Just one additional note: While "my" issue was raised on subtitles, I am surprised (when digging into the details) that there is only one language tag for Chinese. Surprised because I guess for Audio-Tracks there should be a similar problem: Cantonese (spoken mainly in Hongkong and by alot of Oversea Chinese and consquently the language of most Hongkong Movies) vs Mandarin (spoken in Beijing and Mainland). Both of these languages are different from each other and if you can speak one you can not automatically speak the other. I guess Italian is more similar to French then Cantonese is to Mandarin actually :)

  5. m2m

    I guess I found some valid documentation - even so its for quicktime, but I guess something similar applies to mp4 containers.

    Quotet from https://developer.apple.com/library/mac/documentation/QuickTime/QTFF/QTFFChap4/qtff4.html :

    ISO 639-2/T codes do not distinguish between certain language variations. Use an extended language tag atom ('elng') to make these distinctions. For example, ISO 639-2T does not distinguish between traditional and simplified Chinese, so also use 'elng' with the value "zh-Hant" or "zh-Hans", respectively. See Extended Language Tag Atom

    Quicktime itself uses the following language code values

    Traditional Chinese: 19

    Simplified Chinese: 33

    But I guess its more safe to use the Extended Language Tag Atom (https://developer.apple.com/library/mac/documentation/QuickTime/QTFF/QTFFChap2/qtff2.html#//apple_ref/doc/uid/TP40000939-CH204-SW16) which would be then as follows:

    zh-Hant: Chinese - Traditional (繁體) -> chi_trad.traineddata

    zh-Hans: Chinese - Simplified (简体) -> chi_sim.traineddata

    Would be cool if subler could support this :)

  6. Log in to comment