String encoding issue on ID3v23 frames

Issue #40 resolved
IJabz repo owner created an issue

Reading the ID3 tags of a file (link attached) leads to Chinese characters instead of the correct "Frank Sinatra". The file itself seems OK (ripped directly from CD and all other MP3 taggers + Windows read it correctly).

I have debugged your code and tracked it down:

From https://java.net/jira/browse/JAUDIOTAGGER-484

In class org.jaudiotagger.tag.datatype.TextEncodedStringSizeTerminated line 83 the expected encoding is extracted and returns UTF-16 (which leads to Chinese characters when decoding the byte stream).

If I manually change that encoding to "ISO-8859-1" then the string is decoded correctly.

I have uploaded the MP3 file so you can use it for testing: https://s3.amazonaws.com/onair_downloads/tmp/test.mp3 (please notify me when I can take it down)

Comments (2)

  1. IJabz reporter

    The problem with this file is in ID3v23 a unicode string (signified with encoding byte set to 1) signifies that the unicode shoud have a byte order mark indicating whether it is UTF-16 LE or UTF-16BE - but this file not have the byte marks so it is assumed it is BE (at least on Windows-Intel). What we could perhaps do is make a more educated guess (if I change so treats as LE info is ready correctly)

  2. IJabz reporter

    Fix, if ID3 frame marked as UTF-16 and has BOM we use that. but if has no BOM we look to see if first byte contains data or not, if doesn't likely to be BE, if it does likely to be LE (at least for European languages). There is a 50/50 chance of getting it correct for languages that use 2 bytes for most chars such as Chinese/Japanese - but remember here we are only trying to deal with an invalid file here.

  3. Log in to comment