1. Bitbucket
  2. Public Issue Tracker
  3. master
  4. Issues


Issue #6980 duplicate

Cannot view Connect.cs file

Maxim Novikov
created an issue

I have a file in my repository named "Connect.cs". When I am clicking on it, instead of opening the source view page, the file is being downloaded. That happens with the only "Connect.cs" file.

Comments (15)

  1. Zach Davis staff

    This should only happen if the file in question is a binary file, in which case "viewing" the file would only result in garbled text.

    I tried committing a file with a .cs extension, and was able to view it properly on Bitbucket. Is the file in question a normal text file?

  2. Erik van Zijst staff

    To check if a file contains binary data, Bitbucket looks for the presence of a NULL byte (\0).

    Your Connect.cs file is encoded in UTF-16LE and so it contains a lot of null bytes (1 for each ascii character).

    This is a known issue with Bitbucket. We should have a switch on the repo admin page that allows the user to explicitly state what the encoding of the repo is. We can then flag every file that fails to decode in that encoding as binary.

    Not sure how useful this is for you, but switching your encoding to UTF-8 will allow you to work around this issue.

  3. Erik van Zijst staff

    Sure. However, the mere presence of a BOM provides no guarantee that the file actually is text, which complicates things in Bitbucket.

    It's currently taking a pessimistic approach, erring on the side of binary, but we have an internal issue to overhaul our character decoding logic (#5648).

  4. Maxim Novikov reporter

    If the first bytes look like the Unicode BOM, you could at least check for zeros not each byte (8b), but each word (16b) or double word (32b) depending on which BOM it was. Basically what the BOM shows you is that how much space is given for a single character (8, 16 or 32 bits).

  5. Erik van Zijst staff

    Yes, but that kind of encoding-specific logic become very unwieldy very quickly and you'd also have to cater for endianness. UTF-[8|16|32] are just the tip of the iceberg. We need a solution that can handle every kind of encoding.

  6. Maxim Novikov reporter

    What can you face apart from that? You check whether a file is in the "text" format or not to generate the correct response to the user. The file can be either a Unicode one or not. Depending on that you choose what to consider (8/16/32b) as a character and check if all the "characters" are in some predefined range (or just not zeros as it works now). What else can be under that iceberg's tip actually?

  7. Erik van Zijst staff

    There are many non-UTF encoding schemes, several of which are popular in Asian countries (e.g. ISO-2022-CN).

    We currently use chardet to help resolve a file's encoding scheme, which can often correctly distinguish between a large set of encoding schemes, including all the UTF variants.

    However, using chardet is expensive, which is why we sometimes resort to the much faster NULL-check (which is obviously broken as you and others have discovered).

  8. Chunmin Tai

    I have the same problem. My situation is that my company use Unity cross Mac OSX and Windows 7/8. Because cs file on Windows must have UTF-8 BOM, I wrote a script to batch add it. I found that SourceTree on Windows can view that BOM flie, But Mac OSX will show as binary file. How can I fix this problem?

  9. Erik van Zijst staff

    Because cs file on Windows must have UTF-8 BOM, I wrote a script to batch add it.

    This issue is about UTF-16 files, not UTF-8. When you say you add a UTF-8 BOM I'm assuming that as part of that you also convert the entire file from UTF-16 to UTF-8?

  10. Chunmin Tai

    Hello Erik Twee , I found this problem is that My .gitattribute set *.cs as binary.

    Because I want to configure git for preserve line ending of .cs file. But SourceTree will also consider .cs as binary file.

    It was just a misunderstanding.

  11. Log in to comment