Is the encoding of arbtt-stats output guaranteed? I am currently testing with the 0.6.1 version on windows, and i get some weird encoding errors when the window titles contain german umlauts (ä ü ö ß). I'm processing the output with another tool as you'll probably have guessed by now.
If the encoding is not specified, i'd like to request a mode where arbtt-stats always outputs a predictable encoding, for example utf-8.
If this has been changed between 0.6 and 0.9, please disregard this ticket.
Comments (15)
-
repo owner -
reporter It would appear that the output is
ISO-8859-1
on my windows machine, when piped to another command or file. Currently I have to detect the encoding at runtime and convert to utf-8.I guess, now i have two conversions, one by arbtt-stats (internal to iso-8859-1) and one by my script (iso-8859-1 to utf-8). The conversion to iso will probably be lossy, there's a bunch of characters it can't display. I't love to request a mode where i can force arbtt-stats to output
utf-8
regardless of locale and other environment settings. -
repo owner If it is non-trivial to set it via environment variables, I might add a command line flag... but I’m surprised this is so hard.
Have you tried issuing
chcp 65001
before running arbtt? According to http://stackoverflow.com/a/388500/946226 this should set the code page to utf8. -
reporter chcp does not seem to have an effect. My terminal happily tells me, that i'm on that codepage, now. But it still outputs ü as 0xFC (ISO-8859-1 or Windows-1252, as both would look identical in that area).
produces this byte sequence:
-
reporter I am able to convert this in the receiving script, now. I'm auto-detecting the encoding and always convert to utf-8. This way I was able to import ~10.000 window titles, ~400 of which also contained german umlauts. Still, I think it would make a nice addition, especially when using arbtt-stats as a stepstone in a custom chain of tools.
The current handling works very well in the command line. I have never had a problem with that. I do not want that to change.
-
repo owner Of course, the question is first: Does arbtt actually save it correctly internally? It cold well be that the screen capture is wrong...
On the other hand, that’s unlikely, as it would then cause mojibake when printing.
Maybe the problem disappears when I mange to make a new windows release that is then built with a new version of GHC and the base libraries.
-
reporter I was using a build from the current head (7e3b5a7e) and used
dist\build\arbtt-capture\arbtt-capture.exe -f unicode.stuff
to capture the window title of this website in firefox: https://www.qnap.com/i/de/news/con_show.php?op=showone&cid=416 which reads "QNAP unterstützt Kodi – ehemals XBMC - zur Multimedia-Wiedergabe"
both arbtt-dump and arbtt-stats (same build) have problems with this:
> dist\build\arbtt-dump\arbtt-dump.exe -f unicode.stuff 2015-10-14 19:57:44 (0ms inactive): ( ) [redacted for privacy reasons] ( ) \Device\HarddiskVolume2\Program Files (x86)\Mozilla Firefox\firefox.exe: QNAP unterstützt Kodi arbtt-dump.exe: <stdout>: commitBuffer: invalid argument (invalid character)
The output stops there. No further lines are dumped.
Please note that the title reads just fine in the terminal. When i write the same output to a file, this happens:
> dist\build\arbtt-dump\arbtt-dump.exe -f unicode.stuff > unicode.stuff.dump.txt arbtt-dump.exe: <stdout>: commitBuffer: invalid argument (invalid character)
(same error and termination of program)
The "ü" is converted to
81
, which is valid in Codepage 850. This is also what my terminal is set to.If i switch my terminal to
chcp 65001
, the "ü" becomesc3bc
-> which is actually valid utf8. The dump will run through as expected. So in that case, everything is well.arbtt-stats
is also working after issueing a codepage 65001. Interestingly, it does not have the codepage 850 problem. It will work correctly in both cases!So there's a small caveat that running arbtt-dump from a plain and simple terminal does not work. One has to issue the chcp 65001. I am not sure if this can be fixed, i guess many non-programmer users would find this unnerving.
-
reporter (tiny correction to the post above)
-
reporter There also appears to be an issue with old files, created with 0.6, it seems the encoding in the existing legacy logfile might confuse the newer arbtt-stats. I'm investigating. But this also only happens on codepage 850, so a user with that problem can work around it easily. I had no problems with a mixed legacy logfile (mixed in the sense that it was written to by both arbtt-capture 0.6 and 0.9).
-
repo owner Hmm. I am pretty confident that the log files are fixed to utf8, and have been like that since then, so I would hope that the reading of files old and new files is not a problem.
Otherwise the behaviour is somewhat expected: The program tries to print according to the current locale (i.e. codepage), and prefers to abort rather than print invalid characters.
Is it correct that everything works fine as long as your codepage is 65001?
-
reporter Yes, in CP65001, everything is fine.
-
repo owner Ok. I’m inclined to close this, with the argument that if you want to use unicode, you need to use a unicode-aware codepage. Do you agree?
-
reporter I'm fine with that, but i'd top it off with a note in the windows section of the readme. Once i understand how packaging an installer works, i might be able to contribute one. But i can't promise when i get around to doing that.
-
repo owner - changed status to resolved
Mention codepage in the windows readme.
Suggestions to improve this notice and make it easier to follow for “normal” users are welcome. This fixes
#32.→ <<cset 1d8780d50c62>>
-
repo owner Heh, when trying to run the test suite under wine I am now stuck with the same problem, and here, I don’t even have
chcp
available. I hope someone can help me at http://stackoverflow.com/questions/33156758/get-haskell-programs-to-assume-a-utf8-locale-under-wine. - Log in to comment
arbtt should handle unicode properly, and output it in whatever locale your system is running. On Linux, I’d say “make sure that
LANG
is set to a UTF8 locale”...Do you get the weirdness only when piping the output to a file or program, or also when you run it as it is?