File name is different is upload from RC or pushed from HG

Issue #398 resolved
ismdiego created an issue

You can find both RC and Repos folders to reproduce this in the attached ZIP file. Setup details: Admin: admin (admin123) Users: user1 (123456), user2 (123456) Repository groups: Group1 (default->none, user1->read) Repositories: Repo1 under Group1 (default->none, user1->read) RC installed from scratch on clean Windows XP SP3 (english) TortoiseHG v2.3.1 with HG v2.1.1 installed


Upload "File_with_tíldés.txt" using RhodeCode to Repo1

Upload "Second File_with_tíldés.txt" using HG to Repo1 (clone repo, add, commit, push)

Create a new clone of Repo1 using HG and do an update

Problems found in new clone created at step3:

"File_with_tíldés.txt" is correctly shown on Repo1 files section, and its content can be also correctly visualized, but when you do an hg clone, the filename is not good (although the file content is)

"Second File_with_tíldés.txt" is incorrectly shown on Repo1 files serction, and its content can not be visualized. An error "There is no file nor directory at the given path: 'Second File_with_t\xef\xbf\xbdld\xef\xbf\xbds.txt' at revision '0209fa3b1600'" is shown at top. However, both file name and content is OK in the HG clone


Maybe this problem is hard or even impossible to fix. Both HG and Git developer communities are intransigent in how files are handled in Windows environments and instead use an "agnostic" (transparent) way to let the final user/os use the appropiate file encoding scheme. The problem is that you can not specify it under Windows (I think it is possible under Linux variants), so it would be BETTER if simply HG uses an especific enconding internally in the repository bytes (say UTF8 for instance) and then os ports just convert to the final os container... but this is not accepted as I said before.

This renders both HG and Git unusable under multilanguage Windows development, which is really a pity. SVN, Bazzar and others don't have this problem. They are best suited for such scenarios.

There is a HG extension (fixutf8) that forced all the file operations to be encoded in UTF8 to solve this problem. But it is not official and not maintained frequently (it does not work with latests HG versions). Maybe if we can convince HG developers to make it official and maintain it with every HG release... it really works! (I have been using it with a repository shared between english, spanish and japanese developers all using "special" chars)

So I guess Rhodecode is internally using some sort of encoding (maybe it's UTF8) to handle this problem, and then native HG is using native OS encoding (in this case, under Windows, a UTF16 special implementation). Maybe you can implement some sort of workaround at least to display file contents. Here, at non-english speaking countries, we are "sadly used" to read file names with garbled content, which is really frustating being in 2012 as we are... ASCII was designed back in 1963... and you see, 50 years later this problem is not yet fully solved in all computer systems as it should since Unicode was designed)

Comments (4)

  1. Marcin Kuzminski repo owner

    I use all the time UTF8 on my system, and don't have any troubles with encoding. RhodeCode uses internally by default utf8 also. The default encoding can be changed in the .ini file. One thing that might help for mixed encoding is to install chardet library, that will try to detect encoding if utf8(or other given in .ini ) fails.

    I guess you should apply unified encoding for all the clients, i don't see a good solution for this as it's just to many possibilities..

  2. ismdiego reporter

    Hi Marcin,

    The problem is not within the content of the file, which is fine. The problem exists only with the filename encoding. Unfortunately, in Windows, the user can not specify the encoding used for that (as in Linux). For this reason, extensions like "fixutf8" exists (it forces the filename encoding to UTF8).

    Hg (and Git) developers decided (by design) to do not encode filenames when entering the repository, in fact they use the "bytes" the filesystem returns for them. This way the source control system is fully transparent. Unfortunately, transparent does not mean better in this case (they are other systems that work OK in these cases with Windows/Linux like SVN and Bazzar).

    So, I am asking you only for a workaround. I mean, I suppose that RC uses UTF8 for filename encoding when it really should not... Maybe it should also use the "transparent" approach that Hg has.

    Also, other related problem is that when a filename is encoded with Hg (and so using native filesystem encoding, which is not UTF8 in Windows) and later pushed into a repository managed by RC, then you can not see the file content with RhodeCode when clicking over it because it does not find the filename (as it should be trying to access it with the displayed filename, which was incorrectly displayed). I can send you some screenshots or video to better illustrate this problem if you want.

    In fact, I can help you with this problem as much as I can (remember I am not a Python dev).

    Thanks for your work

  3. Marcin Kuzminski repo owner

    Yes exactly that's why RhodeCode has an issues with that, it assumes everything is utf8, this can be changed as i said in .ini file, and further extended by installing chardet. It's impossible to use "transparent" approach due to most of the things needs to be converted to unicode, and RhodeCode needs to know from what encoding the files should be converted to unicode.

    I guess you need to use unified encoding, as in mercurial principles it says "non-ASCII filenames are not reliably portable between systems in general"

    I think it's impossible for RhodeCode to handle that

  4. Log in to comment