File encoding in diff view (BB-10792)

Issue #9686 open
Alpha Nieves
created an issue

When I see a commit that changes ISO-8859-1 to UTF-8 to fix accent, in the diff view the result is inversed, for example:

-Categoría:</td><td align="right">
+CategorĂ­a:</td><td align="right">

But in the source code in RAW mode, the code is:

Categoría:</td><td align="right">

I made some refactor in my company project to fix this and transform with iconv all files in ISO-8859-1 to UTf-8, and every commit that I see in diff view the result is inversed (only in diff view, in code it's perfect!)

I can show the link to commit to Bitbucket's support team if needed.

Comments (41)

  1. Luciano Silveira

    I'm writing a text in TeX, in portuguese; and I found problem with character encoding. This text:

    %%%%%%%%%%%%%% Material e métodos
    \cleardoublepage
    \chap Material e métodos
    

    becomes

    %%%%%%%%%%%%%% Material e mйtodos
    \cleardoublepage
    \chap Material e mйtodos
    

    Is there a solution?

  2. Alpha Nieves reporter

    The problem is too simple, only need to update the encoding of the view to show the file content right, is not admissible that this ticket still open since 2014-06-10, our teams has migrated all repos to GitLab just for this issue.

    Bitbucket you lost many people every day thanks to this issue SLOW-workflow

  3. Luiz Filho

    Because this and many other bugs but more importantly the lack of support and improvements on BitBucket, we moved over to GitHub.

    I can contribute to thread with the solution I created for my company here: I have created a shell script that converts all files to UTF8. Then run it on master and all other branches as well. Then orient all developers to only use UTF8 for their files. It was too much work though.

  4. John Brewster

    This is affecting our ability to use Bitbucket's pull request functionality. Our code reviews obviously rely on team members being able to review code via the diff viewer within a Bitbucket pull request. However, because our source code files are encoded via ASCII ISO-8859-1, this problem with the diff viewer means code reviews aren't possible for source code files which contain characters such as the Euro symbol € and others outside of the Basic Latin ASCII range, because said characters are replaced with �. The only workaround is for team members to separately browse to the appropriate branch and view the source code in raw mode - not ideal.

    Being able to change the encoding used by Bitbucket's diff viewer (either repository-wide, or on a file-by-file basis) should be changed from minor to major priority if Bitbucket are serious about supporting pull request functionality.

  5. Erik van Zijst staff

    @John Brewster I'm pasting the response from the support case here, as it might be useful to the others that have run into this.

    The problem here lies with trying to detect the character encoding of your files.

    As you may know, Git and Mercurial work store raw bytes without any meta information about encoding. This means that the client will have to make assumptions on the character set when converting the bytes to text for printing.

    Bitbucket faces these problems when displaying source code on a page. We serve our pages as UTF-8 and so when we display repo content, we need to be able to distinguish between binary files (that we cannot inline) and text. And then for text files, do our best to guess the encoding so we can decode it to unicode.

    It is this detection that is problematic. These days most files use UTF-8 and that is the first thing we will try. However, Windows still actively uses several other encodings (including UTF-16, ISO-8859-2 and Windows-1252).

    We use a common library for this (http://chardet.readthedocs.org/en/latest/) that when run on your file, incorrectly classifies it as ISO-8859-2:

    In [16]: chardet.detect(open('file.cls').read())  
    Out[16]: {'confidence': 0.7514906781486318, 'encoding': 'ISO-8859-2'}
    

    The amount of content and symbols used affect the detection process and so different files encoded with the same encoding can end up being detected differently. For instance, if I grep out just the lines that contain the currency symbols, we get:

    In [22]: chardet.detect(open('file.cls').read()[4900:5200])
    Out[22]: {'confidence': 0.73, 'encoding': 'windows-1252'}`
    

    It's important to realize that many encoding schemes overlap, and detecting text encoding is an inherently unreliable process. In fact, chardet has a paragraph dedicated to windows-1252 specifically: http://chardet.readthedocs.org/en/latest/how-it-works.html#windows-1252

    Now on your PR you noticed that the file is displayed correctly when you hit "raw". However, this is not because Bitbucket suddenly knows how to detect it better. What happens here is that when we serve raw files, we do just that: we read the file byte for byte and echo that to your browser. We are not serving text. We are essentially uploading binary content. We don't even include a Content-Type header in the response and make it entirely the client's problem to interpret the content.

    Clearly modern browsers' heuristics here are different than the chardet library as the browser comes to a different conclusion (in this particular instance the correct one). It is entirely possible another browser to get it wrong (particular old, pre-HTML5 ones, as HTML5 specifies specific encoding detection rules).

    To wrap this all up, I'm afraid there's not too much I can do about this. If you allow me, I could take your file to the chardet maintainers to see if things could be tweaked without breaking things elsewhere, but at the end of the day character encoding detection is a flawed, unreliable business. As long as Git and Hg don't provide this meta data, clients are left guessing.

  6. Mauro Molinari

    Still can't understand why BitBucket can't simply allow the user to specify the encoding at folder or repository level, or support git encoding attribute in .gitattributes file...

  7. Erik van Zijst staff

    @Mauro Molinari Yes, it could. And we probably should try that first, when available. However, John's case was for Mercurial and I'm not aware of a similar mechanism to allow clients to publish encoding meta data to a remote hg repository.

    For Git we have an existing internal issue for which I will raise the priority.

  8. Sérgio Siegrist

    While "trying to detect the character encoding" is not the main problem - since it could or should be specified otherwise - you should take a look on your library. Nowhere else have I seen simple WINDOWS-1252 (or ISO-8859-1) files being decoded with Russian characters as a rule. I'm afraid that library is broken. I suppose there was an issue with a Russian member that fixed that for him, and broke it for everybody else. I don't recall its number now.

  9. John Brewster

    Erik, while I appreciate your response regarding the difficulties faced with reliable character encoding detection, our main concern remains only the problems with the Bitbucket diff viewer, within pull request code reviews.

    If unresolved, this is going to become more of an issue for us because our company is currently expanding into a second international market. Efficient timely pull request code reviews are a vital part of our development team's daily work.

    For that reason, I still think that Bitbucket needs to provide a simple option to change the encoding used when displaying the online diff viewer results, again either repository-wide or on a file-by-file basis.

    In reply to your question, yes you are welcome to take our file to the chardet maintainers, for further analysis.

  10. Erik van Zijst staff

    Yes, I agree with you both. For Git we should respect .gitattributes (and we'll definitely do that), while for Mercurial we might have to invent our own solution.

  11. Antanas V.

    Eric, do You have plans to implement this feature?

    This is the main issue our company is still doubting about moving to Bitbucket since we use ASCII ISO-8859-13 charset for our biggest project

  12. Minku Yeo

    @Erik van Zijst : Same here. Sometimes, my teammates write comments in Korean and push it to Bitbucket. However, the Korean characters cannot be printed out.

    Anyways, is there any plans to fix this? My team wants this issue to be fixed desperately.

  13. Git Bot

    MinKu_Yeo when I created the issue (created in 2014-06-10) my team needs to view the right encoding in diff view, now 2 years later all my team members moved to GitHub and works like a baus, soo the only solution for this issue is to fuck off the Bitbucket. Also try GitLab that works like a charm.

  14. Saeid M

    I'm also having problem when editing HTML files on bitbucket with German characters with umlaut (ö or ü, etc). They show up in browser wrong (e.g. Tübingen shows as T�bingen). Github doesn't seem to have problem with same characters however. Is moving to Github the only way?

  15. Sean Farley staff

    Now that we've upgraded our pygit2 usage, we should be able to implement this using git's .gitattribute file. Are any of the repos here already using that?

  16. Saeid M

    I added a .gitattribute with following line in it, but it did not fix the problem for me!

    *.html utf-8

    What I find strange is that when I view HTML files on bitbucket Website (either using View or Raw button) umlaut characters show correctly, but if I pull the files to my local machine and view them in an editor, umlaut characters show wrong!

  17. Alpha Nieves reporter

    This is the real problem, in your code you can view the characters perfectly, but in the Bitbucket website (diff view) show always with a bad charset characters, for example in your code you have:

    Categoría
    

    In the Bitbucket website (diff view) show as follows:

    CategorĂ­a
    

    Ergo if you view in the Bitbucket website (diff view) the code perfectly mean that in your code the charset is wrong.

  18. Alpha Nieves reporter

    The same problem happens if you edit some file repository online in the Bitbucket website and put some special character. Is a Bitbucket web interface problem, not a git issue.

  19. Simo Tuokko

    +1 Just found this as new issue, with Spanish characters in a UTF-8 file.

    Atlassian, please fix this (it has been known for 3 years): Data-integrity is really important part of code repository service. Especially when this only happens in the localization files that are not in the native language of most developers, these changes would be really difficult to find out afterwards. If I had not noticed this myself in my IDE, we would have waited to hear about it from our Spanish end-users.

  20. Log in to comment