1. Bitbucket
  2. Public Issue Tracker
  3. master
  4. Issues


Issue #5648 resolved

UTF-16 little endian files are downloaded rather than displayed (BB-6918)

Eric Knibbe
created an issue

When working in AppleScript, using Script Editor to save a file as text generally encodes it in Latin-9, or Mac OS Roman if accented characters are used. But if any characters not covered by those sets are present, the text file is encoded as UTF-16LE. And if such a file is clicked on in Bitbucket, the file is downloaded rather than displayed and colourized inline.

As an example, here's a UTF-16LE file: https://bitbucket.org/EricFromCanada/ericfromcanada.bitbucket.org/src/55865a3ee2bf/applescript/close%20Safari%20Web%20Inspector.applescript

And its next revision, after conversion to UTF-8+BOM: https://bitbucket.org/EricFromCanada/ericfromcanada.bitbucket.org/src/51ba083a8253/applescript/close%20Safari%20Web%20Inspector.applescript

Comments (31)

  1. Ben Lachman

    I don't think trivial is the correct priority for this bug. Cocoa projects lose the ability to diff strings files because of this bug which makes the localization process super annoying.

  2. Ben Lachman

    Good link Eric. That definitely improves things. I'm kind of surprised Xcode doesn't auto convert these at this point since the feature has been around for quite a while.

  3. _dev_

    I would expect that normal UTF16 files (the ones with BOM) should definitely be classed as text files. In any case, file content is already examined when deciding what type it is.

  4. Will Brown

    My organization was looking to move our Microsoft BizTalk Server codebase (which is developed in Visual Studio) into Bitbucket.org, but most of the BizTalk code artifacts are UTF-16 LE BOM... As long as we didn't attempt merges, we were ok, but as soon as we did, the code became corrupt and unreadable.

    I'm a Git noob (I pretty much rely on the SourceTree GUI) so I was hoping this was my fault, but it looks like I'm not alone in the UTF-16 woes...

  5. Scoopta

    All of my code is UTF-16BE and it'd be really nice to actually have a source view on the website. There is a mercurial extension for UTF-16 diffs. While I haven't used it that's primarily because I haven't needed to use diffs otherwise I'd probably give that a try but that doesn't fix the issues on bitbuckets end.

  6. Abhin Chhabra staff

    The fix has been deployed to production. The example link (in the bug report description) now works as expected. Bitbucket now respects the BOM in the file and doesn't consider files starting with the BOMs for UTF-8, UTF-16-LE, UTF-16-BE, UTF-32-LE and UTF-32-BE to be binary.

  7. Tom Kedem

    Hey, thanks a lot for finally fixing this :)

    However it's still not possible to view commit diff... I see a modified marker but it says File contents unchanged. while showing +0 and -0 lines added/removed, which is not the case. I expect those files having the same treatment as regular textual files.

  8. Abhin Chhabra staff

    You're right Tom Kedem. But since that issue is unrelated to this one (this one was about the source view), I've created a separate ticket to track it (https://bitbucket.org/site/master/issues/13930/utf-16-and-utf-32-files-dont-show-up-in).

    In this case, the issue is that Git itself doesn't (by default) recognize UTF-16 files as text. In fact, running git show locally on a commit that updates a UTF-16 file also seems to claim that the 2 files are binary. As mentioned in that new ticket, it is possible to fix this, but it would have to be a separate piece of work.

  9. Log in to comment