Encoding of arbtt-stats

Create issue
Issue #32 resolved
amenthes created an issue

Is the encoding of arbtt-stats output guaranteed? I am currently testing with the 0.6.1 version on windows, and i get some weird encoding errors when the window titles contain german umlauts (ä ü ö ß). I'm processing the output with another tool as you'll probably have guessed by now.

If the encoding is not specified, i'd like to request a mode where arbtt-stats always outputs a predictable encoding, for example utf-8.

If this has been changed between 0.6 and 0.9, please disregard this ticket.

Comments (15)

  1. nomeata repo owner

    arbtt should handle unicode properly, and output it in whatever locale your system is running. On Linux, I’d say “make sure that LANG is set to a UTF8 locale”...

    Do you get the weirdness only when piping the output to a file or program, or also when you run it as it is?

  2. amenthes reporter

    It would appear that the output is ISO-8859-1 on my windows machine, when piped to another command or file. Currently I have to detect the encoding at runtime and convert to utf-8.

    I guess, now i have two conversions, one by arbtt-stats (internal to iso-8859-1) and one by my script (iso-8859-1 to utf-8). The conversion to iso will probably be lossy, there's a bunch of characters it can't display. I't love to request a mode where i can force arbtt-stats to output utf-8 regardless of locale and other environment settings.

  3. amenthes reporter

    chcp does not seem to have an effect. My terminal happily tells me, that i'm on that codepage, now. But it still outputs ü as 0xFC (ISO-8859-1 or Windows-1252, as both would look identical in that area).

    chcp.png

    produces this byte sequence:

    hex.png

  4. amenthes reporter

    I am able to convert this in the receiving script, now. I'm auto-detecting the encoding and always convert to utf-8. This way I was able to import ~10.000 window titles, ~400 of which also contained german umlauts. Still, I think it would make a nice addition, especially when using arbtt-stats as a stepstone in a custom chain of tools.

    The current handling works very well in the command line. I have never had a problem with that. I do not want that to change.

  5. nomeata repo owner

    Of course, the question is first: Does arbtt actually save it correctly internally? It cold well be that the screen capture is wrong...

    On the other hand, that’s unlikely, as it would then cause mojibake when printing.

    Maybe the problem disappears when I mange to make a new windows release that is then built with a new version of GHC and the base libraries.

  6. amenthes reporter

    I was using a build from the current head (7e3b5a7e) and used

    dist\build\arbtt-capture\arbtt-capture.exe -f unicode.stuff

    to capture the window title of this website in firefox: https://www.qnap.com/i/de/news/con_show.php?op=showone&cid=416 which reads "QNAP unterstützt Kodi – ehemals XBMC - zur Multimedia-Wiedergabe"

    both arbtt-dump and arbtt-stats (same build) have problems with this:

    > dist\build\arbtt-dump\arbtt-dump.exe -f unicode.stuff
    2015-10-14 19:57:44 (0ms inactive):
        ( ) [redacted for privacy reasons]
        ( ) \Device\HarddiskVolume2\Program Files (x86)\Mozilla Firefox\firefox.exe: QNAP unterstützt Kodi arbtt-dump.exe: <stdout>: commitBuffer: invalid argument (invalid character)
    

    The output stops there. No further lines are dumped.

    Please note that the title reads just fine in the terminal. When i write the same output to a file, this happens:

    > dist\build\arbtt-dump\arbtt-dump.exe -f unicode.stuff > unicode.stuff.dump.txt
    arbtt-dump.exe: <stdout>: commitBuffer: invalid argument (invalid character)
    

    (same error and termination of program)

    arbtt-stats-encoding.png

    The "ü" is converted to 81, which is valid in Codepage 850. This is also what my terminal is set to.

    If i switch my terminal to chcp 65001, the "ü" becomes c3bc -> which is actually valid utf8. The dump will run through as expected. So in that case, everything is well.

    arbtt-stats is also working after issueing a codepage 65001. Interestingly, it does not have the codepage 850 problem. It will work correctly in both cases!

    So there's a small caveat that running arbtt-dump from a plain and simple terminal does not work. One has to issue the chcp 65001. I am not sure if this can be fixed, i guess many non-programmer users would find this unnerving.

  7. amenthes reporter

    There also appears to be an issue with old files, created with 0.6, it seems the encoding in the existing legacy logfile might confuse the newer arbtt-stats. I'm investigating. But this also only happens on codepage 850, so a user with that problem can work around it easily. I had no problems with a mixed legacy logfile (mixed in the sense that it was written to by both arbtt-capture 0.6 and 0.9).

  8. nomeata repo owner

    Hmm. I am pretty confident that the log files are fixed to utf8, and have been like that since then, so I would hope that the reading of files old and new files is not a problem.

    Otherwise the behaviour is somewhat expected: The program tries to print according to the current locale (i.e. codepage), and prefers to abort rather than print invalid characters.

    Is it correct that everything works fine as long as your codepage is 65001?

  9. nomeata repo owner

    Ok. I’m inclined to close this, with the argument that if you want to use unicode, you need to use a unicode-aware codepage. Do you agree?

  10. amenthes reporter

    I'm fine with that, but i'd top it off with a note in the windows section of the readme. Once i understand how packaging an installer works, i might be able to contribute one. But i can't promise when i get around to doing that.

  11. nomeata repo owner

    Mention codepage in the windows readme.

    Suggestions to improve this notice and make it easier to follow for “normal” users are welcome. This fixes #32.

    → <<cset 1d8780d50c62>>

  12. Log in to comment