Repository character encoding detection isn't accurate in some cases (BB-2979)

Dear BitBucket!

When i create folder "РегистрСведений1" bitbucket.org create folder with name "–егистр—ведений1"

You can find this bug in this repo - https://bitbucket.org/karlobruni/co2/src/d500b04e6da7/examples

Thank you very much for response !

  1. David Chambers

    Unless I'm mistaken, this is not a Bitbucket error. If a file's name is correctly encoded it will display correctly on Bitbucket. See, for example, davidchambers/i18n-test/src/35c52e8acd66/Семестр4.

    How was the folder created? If your operating system is letting you down you could use the mkdir command (or the Windows equivalent). This is how I created the folder displayed in the aforementioned link.

  2. Brodie Rao

    The problem here is that we're trying to convert the filename to Unicode, but our detection code thinks it's MacCyrillic instead of windows-1251.

    Detecting character encoding is inherently about making educated guesses, so we can't always get it right. When this happens, we do try to degrade gracefully, and I think we have in this case.

    That said, we can probably improve our character detection. For example, we could do character detection on an whole-repo level (using every filename), instead of at a per-filename/path level. I've filed an internal issue to take a look at doing this.

