Wired filenames with unicode chars in filename

Issue #216 wontfix
Dominik Guder
created an issue

Hi,

I get wired clone if there are unicode chars in file names. A German "weiß.txt" (color white) will become "weiß.txt" in file system and 'weic39f.txt.i' in .hg\store\data folder.

If I add a unicode named file to hg it will end up as wei~~df.added.in.hg.txt.i in .hg/store/data and could not be pushed to svn with error '"Path '/trunk/wei\xdf.added.in.hg.txt' is not in UTF-8", 160005'

I'm using TortoiseHg 1.1.4 with hg 1.6.4 with actual hgsubversion on Windows XP Pro 32bit. Maybe you can give me a hint where to look at, since I'm not be able to debug hgsubversion on windows with WinPdb 1.4.8

Thanks so far Dominik

Comments (7)

  1. Dan Villiom Podlaski Christiansen

    Hi Dominik,

    The cause of issue is a discrepancy between Mercurial and Subversion: Where Subversion considers paths to be UTF-8, Mercurial considers them to be raw byte strings with no attached encoding. I don't agree with that decision, but it's unlikely to change any time soon. In my opinion, fixing this is outside the scope of hgsubversion; we correctly converted the repository into Mercurial.

    Could you please try the FixUTF8 extension? If it doesn't work for you, we may have to allow re-encoding paths during conversion. I'd really like to avoid opening that can of worms…

    http://mercurial.selenic.com/wiki/FixUtf8Extension

  2. Dominik Guder reporter

    Hi Dan,

    thanks for your help. I checked this with FixUtf8Extension and now it is working. I can clone and add/push a file containing German chars to my repository.

    From FixUtf8Extension: "Python 2.x and Mercurial call the non-Unicode functions". I don't really understand why this is done since at least Win2k was unicode capable.

    Nevertheless, you might close this issue (or should I?).

    So far Dominik

  3. Dan Villiom Podlaski Christiansen

    I believe the use of the non-Unicode APIs by Python 2.x is in part caused by legacy — they were originally written against pre-NT versions of Windows which didn't have the Unicode APIs — and in part caused by the fact that Python 2.x itself doesn't treat filenames as Unicode, but as raw byte strings. As you may know Python 3.x fixes this. As you may also now, Mercurial doesn't run on Python 3.x. There has been some effort to fix this, but it's uncertain if a finished Python 3.x port will ever surface.

    I'm glad the fixutf8 extension solved this for you; as a result, I'm marking this as WONTFIX. (Which I tried to do before, but I stumbled on bugs in BitBucket.)

  4. Dominik P

    This issue just hit me hard. Using an older SVN server (1.4) I was able to actually commit a filename encoded with ISO8859 (cp850) to subversion and now the clients are no longer able to fetch the svn log from the server because it contains an invalid encoding that the client interprets as broken XML. See this:

    http://mail-archives.apache.org/mod_mbox/subversion-users/201507.mbox/%3C000301d0c93f%2455d0e320%240172a960%24%40apache.org%3E

    svn log svn: E130003: The REPORT response contains invalid XML (200 OK)

    The repository at the server is now effectively broken and we will have to fix it with svndumpfilter.. ouch That being said, mercurial does not care about filename encoding but subversion does and it expects filenames as UTF-8. Therefore hgsubversion must convert filenames to UTF-8 since that is the only encoding subversion can work with!

    BTW, FixUtf8Extension is no longer an option because it is incompatible with the latest mercurial / tortoisehg.

  5. MJ

    Same here :( Working in Germany, language English, common problem to use äöüß etc Interestingly working in German (PCx64) system it's ok. Working in an English system (PC x64) not so good :( Now I can't push to the remote server... even more worrying. So if anyone has an x64 solution... i be all ears...

  6. Log in to comment