Issue #672 open

shell extension unicode support

Vsevolod Parfenov
created an issue

Is it possible to add unicode support to TortoiseHg shell extension to make it possible for fixutf8 extension to handle mbcs excoding of filenames correctly?

Please see issue http://bitbucket.org/stefanrusek/hg-fixutf8/issue/15/wrong-changes-detection-in-tortoisehg for details.

This is the last thing I need to be comfortable with TortoiseHg on repos with non-ascii file and folder names in it.

Comments (34)

  1. Stefan Rusek

    For a while there was a dispatch bug with how fixutf8 interacted with hgtk, I fixed that earlier this week, and now hgtk works beautifully with unicode, but the shell extension doesn't handle unicode at all. If the shell extension were to be updated to handle unicode properly, it wouldn't interfere with hg/hgtk operations without fixutf8.

  2. Stefan Rusek

    I would be interested in helping out with this. I wrote the cutehg shell extension which works *very* similarly to the thg one, and it is unicode everywhere possible. Ben Pollack has also expressed some interest/willingness to help on this front.

  3. Adrian Buehlmann

    One good reason for not "supporting unicode" is probably that Mercurial doesn't support it either.

    Or someone has to explain to me what the encoding of the filenames in .hg/dirstate should be (in a portable way).

    It's hardly a problem of shell extension programming. Adding wide char support would be rather trivial. But I'm not interested.

  4. Stefan Rusek

    hg uses local encoding or utf8 (when the fixutf8 extension is used) for the dirstate file. One way would be to ask hg what the value of util._encoding is, but the fixutf8 extension could easily add something to the dirstate file to tell the dirstate parser to use utf8.

  5. Adrian Buehlmann

    Stefan:

    In case you want to jump into this:

    You could add the encoding info at the beginning of the file .hg/thgstatus. This file has the advantage that it is under full control of TortoiseHg alone and read anyway by the shell extension.

    thgstatus is written by the thgtaskbar.exe, which has the full python+mercurial plethora loaded into it's process space (and thus could easily access mercurial's util._encoding).

    The shell extension could then read what the encoding is from .hg/thgstatus and interpret .hg/dirstate accordingly.

    See http://bitbucket.org/tortoisehg/stable/src/17fc2562d687/win32/shellext/DirectoryStatus.cpp#cl-58 for the somewhat similar @@noicons configuration mechanism, which is used to turn overlay icons off in a specific repo.

  6. Adrian Buehlmann

    4207de373119 is the complete solution? only two patches?

    I see changes to DirectoryStatus.cpp for the shellext.

    No other changes needed?

    No changes needed for Directory.cpp or dirstate.cpp? (these treat mercurial's .hg/dirstate file)

  7. Stefan Rusek

    As is this patch doesn't provide anything other than passing the encoding between hg and using it where it is unpacked. Adding it to Directory.cpp and dirstate.cpp would make the most sense by switching to std::wstring just about everywhere. I will start on this.

  8. Adrian Buehlmann

    In reply to Stefan:

    Ok. Sounds good.

    I somehow misinterpreted from your post that you were finished :)

    On another note, I noticed that you pushed to the stable branch. I'm not sure if this feature will make it into any 0.9 bugfix release. Maybe it should go into default instead? (targeting next major: 0.10)

    Or what do you think, Steve? (minor branch woes again... :)

  9. Anonymous

    Adrian: At first I was going to use std::wstring for all filenames, but given their pervasiveness, it seemed a better call to change the project to Unicode and use TCHAR and std::tstring (which I declared) almost everywhere and only char and wchar_t where it made sense to explicitly use one or the other. This all is pretty straight forward to do.

    I wanted to ping you on how you thought it best to handle propagating the codepage around.

    When calling hgtk, we can just use Unicode except when myFiles is not empty since it passes them in a tempfile, so the codepage has to be passed in to CShellExt::DoHgtk(). This is also pretty straight forward.

    When it uses a namedpipe to communicate with the rpc server things get tricky. fixutf8 can be enabled globally or on a per repo basis. While most people use fixutf8 globally, it does make sense in some situations to use or not use it on a repo, but the rpc server uses the same hg state for all repos. This causes all kinds of problems. The RPC server could be modified to use Unicode and use the subprocess module to communicate with hg, but I wanted to see what your thoughts were, before going down that path.

  10. Anonymous

    Adrian: At first I was going to use std::wstring for all filenames, but given their pervasiveness, it seemed a better call to change the project to Unicode and use TCHAR and std::tstring (which I declared) almost everywhere and only char and wchar_t where it made sense to explicitly use one or the other. This all is pretty straight forward to do.

    I wanted to ping you on how you thought it best to handle propagating the codepage around.

    When calling hgtk, we can just use Unicode except when myFiles is not empty since it passes them in a tempfile, so the codepage has to be passed in to CShellExt::DoHgtk(). This is also pretty straight forward.

    When it uses a namedpipe to communicate with the rpc server things get tricky. fixutf8 can be enabled globally or on a per repo basis. While most people use fixutf8 globally, it does make sense in some situations to use or not use it on a repo, but the rpc server uses the same hg state for all repos. This causes all kinds of problems. The RPC server could be modified to use Unicode and use the subprocess module to communicate with hg, but I wanted to see what your thoughts were, before going down that path.

  11. Anonymous

    Adrian: At first I was going to use std::wstring for all filenames, but given their pervasiveness, it seemed a better call to change the project to Unicode and use TCHAR and std::tstring (which I declared) almost everywhere and only char and wchar_t where it made sense to explicitly use one or the other. This all is pretty straight forward to do.

    I wanted to ping you on how you thought it best to handle propagating the codepage around.

    When calling hgtk, we can just use Unicode except when myFiles is not empty since it passes them in a tempfile, so the codepage has to be passed in to CShellExt::DoHgtk(). This is also pretty straight forward.

    When it uses a namedpipe to communicate with the rpc server things get tricky. fixutf8 can be enabled globally or on a per repo basis. While most people use fixutf8 globally, it does make sense in some situations to use or not use it on a repo, but the rpc server uses the same hg state for all repos. This causes all kinds of problems. The RPC server could be modified to use Unicode and use the subprocess module to communicate with hg, but I wanted to see what your thoughts were, before going down that path.

  12. Adrian Buehlmann

    No problem.

    I suggest you post your problems (and suggested solutions) to Tortoisehg-develop@lists.sourceforge.net, so Sune and Henrik might read it as well too (I suspect they don't read tortoisehg-issues@lists.sourceforge.net or this issue here and they might have some ideas/suggestions).

    Per the TCHAR thing, I'm generally not that much of a fan of using such MS types like TCHAR and get locked into more of a dependency on Microsoft types than needed, but I wouldn't pay that much weight on that opinion of mine. My goal is to step back on the shell extension hacking anyway and so if you have good reasons to go using TCHAR then I trust you to do the right thing. If it is just for the reason of having to change usage of std::string to std::wstring in lots of places in the code, then I would personally prefer doing that, instead of introducing TCHAR. But again, I don't want to be a naysayer on this.

    The rpc server sounds indeed tricky. I haven't yet understood the part "the rpc server uses the same hg state for all repos". thgtaskbar.py basically calls shlib.update_thgstatus, giving the path of the repo as a parameter. update_thgstatus (in shlib.py) then creates a repo object for each path ("hg.repository(ui, root)", where root is the path). So I fail to see why the rpc server "uses the same hg state for all repos"?

    To be blunt: I have no idea how i18n paths (in various encodings?) should be fed into that mercurial API ('hg.repository').

    It seems to me that the shell extension and the rpc server need to agree on the same protocol about how to encode paths (independent of any global or per repo settings). So the paths sent over the pipe to the rpc server probably need to be in a standard encoding (using wide chars as well?).

    As I understand it, a python unicode string object only exists as an object of an opaque type at runtime in python's runtime memory. Sending that over the wire (a pipe) must choose an encoding to transform that object into a sequence of bytes. The communication channel over the pipe probably needs to use a single encoding.

    But I admit I'm not that much interested in i18n things in combination with mercurial. So there might be some stupid ideas and opinions in this post.

    (probably time to move on to the devel mailing list with this :)

  13. Toshi MARUYAMA

    I reflected Steve's following google group posted logic to my win32 shellext.

    http://groups.google.com/group/thg-dev/browse_thread/thread/006c258628e3fce8/09933000c510fa60?show_docid=09933000c510fa60

    1. thg.exe with --listfileutf8 option.
    2. hgtk.exe with --listfile option.
    3. thg.cmd with --listfileutf8 option.

    I pushed normal changesets and MQ.

    Normal changeset: http://bitbucket.org/marutosi/tortoisehg/changeset/131f3d5caac3

    MQ: http://bitbucket.org/marutosi/tortoisehg-shellext-mq/changeset/6cd5adcccf94

  14. Toshi MARUYAMA

    I uploaded Windows shellext dlls (ThgShellx86.dll and ThgShellx64.dll).

    I don't have 64bit Windows now, so I can't confirm to run 64bit dll. You can replace existing dll to new dll by the way of the following link.

  15. Log in to comment