Conversions to support untranslated wide character paths on windows.

#29 Declined
Repository
cx_Freeze_old
Branch
default
Repository
cx_Freeze
Branch
default
Author
  1. Steven Velez
Reviewers
Description

Please review changes to help improve support for running frozen applications in paths with characters requiring the unicode character set on windows.

Comments (30)

  1. Steven Velez author

    Ping/Bump to see if Anthony or someone has a chance to consider this request now that the latest CX_FREEZE has been released. If I need to create a new request against a more current commit in the source repo, please let me know.

    1. Thomas Kluyver

      There are no conflicts, at least, though that doesn't necessarily mean the merged code would still work correctly. I don't know when Anthony might get a chance to review this, though. I have a bit more time, but still not much, and it would take me much longer to review, because I don't do much in C.

  2. Gergely Erdélyi

    Just to add some anecdotal evidence, we have been using this patch on our cx_Freeze for a desktop app with 32-bit Python 3.3 on Windows. So far we have not been able to break it with any unicode path we have tried. Tested extensively on Windows XP SP3, Windows 7 and Windows 8.1.

  3. Anthony Tuininga repo owner

    Has anyone tested any of this on anything other than Windows? Without looking at in depth it is possible that this might have messed something up on that front. If someone has tested it on Linux and Mac that would help in getting this merged.

    1. Steven Velez author

      I regularly develop and support OSX, so I am pretty sure that I tested on 10.8 and have been using the code actively on 10.8 and 10.9, but I will make sure as soon as I can get back to the source.

      I have not tried on linux.

    1. Steven Velez author

      Just validated that I have been using these changes on OSX in my product for while now. I also re-checked the intial commit message and noticed that i did test a set of samples on Ubuntu also, though additional testing is welcome. Don't forget the follow up commit which solves a memory management issue.

      Thanks!

    1. Thomas Kluyver

      4.3.3 was released fairly quickly to get things working on Python 3.4. Anthony is planning a bigger more disruptive release, probably to be called cx_Freeze 5.

    2. Anthony Tuininga repo owner

      Yes, there is a reason and it is essentially as Thomas states. Your patch is quite invasive (by necessity!) and I would like to look it over in greater depth before including it. The changes I am now planning are also quite invasive so it seemed to me to be a perfect time to include your patch as well -- in some fashion. The changes I am planning also change the base executable C source code so I'll be testing them anyway. Hopefully I will have something to test out in the next few weeks.

      1. Hannu Valtonen

        Any progress on this? I'm personally in need of getting a combo of cxfreeze python3.4 and this patch (or it's equivalent) soonish. I was wondering that since the patch has conflicts right now whether to just fix the conflicts or wait for your unicode changes.

        1. Anthony Tuininga repo owner

          The changes I have included should cover the changes made by this patch. If someone could verify that fact for me, that would be appreciated. If my changes actually broke this, however, please advise. At the least any subsequent patch should be dramatically simpler and therefore more likely to be quickly merged!

          1. Steven Velez author

            Anthony,

            A simple test against changeset 90ada5fef68a indicates that the issue this patch was intended to address still exits. Should I be testing against a branch?

            To reproduce the problem simply freeze a program on windows (GUI executable, though I am pretty sure the problem should exist with the console executable as well) and move it to a folder with non-latin characters. I like to name my folders "日本語". Without my patch, I get a dialog with the title "cx_Freeze Fatal Error", and the content "cannot get zipimporter instance".

            Is there something I can do to help get the patch in the main development stream? I can assure you it works fine on at least two platforms (Windows and OS X) and I have done basic testing on linux.

            I can take a stab at resolving the existing conflicts, but would rather not spend the effort if the likelyhood of acceptance in to the main stream is low, or if you would like extensive rework anyways.

            Thanks

            1. Anthony Tuininga repo owner

              Hi Steven,

              I just tried using the folder name provided above on my Linux box and the code worked just fine. I'll have to try on Windows tomorrow.

              The original patch you provided has been mostly superseded, I believe. Many of the changes you made have already been integrated, just in a different way. I had hoped that I got everything but apparently I missed something. :-) At the least any patch you come up with now should be substantially smaller and therefore much more likely to be merged quickly. If you find that is not the case, please advise!

            2. Anthony Tuininga repo owner

              Hi Steven,

              I just tried this on Windows and it doesn't work there. The primary reason is because GetModuleFileName() doesn't return the data in whatever encoding you happen to be using but replaces the characters with "?" instead. So it would seem that you have to use GetModuleFileNameW() to explicitly return unicode -- but then you have to figure out what encoding you need to use when conversing with Python (for Python 2.x) and that just seems to me to defer the issue instead of solving it. And I tried putting Python 2.7 in a directory like the one named above....and it doesn't work. Python 3.x works fine. So it looks to be a Windows specific issue with no obvious solution. Thoughts?

              1. Steven Velez author

                TL;DR version: Fixing frozen programs running against python 3 shouldn’t be hampered by the lack of a solution against python 2.7

                Elaboration: Yes, I agree, the problem lies in trying to use the windows ascii apis (of which GetModuleFileName is just one). Incidentally, GetModuleFileName is just a macro which is defined to either GetModuleFileNameA or GetModuleFileNameW based on whether _UNICODE is defined. Therefore, to avoid confusion, further references will refer to the real function names.

                So anyway, NTFS stores file names as wide characters, UTF-16 if I am not mistaken, and when using the ascii apis they have to be converted to multi-byte characters. I'll be honest and say that in all of the years that I have been doing international windows programs, that has meant using the UNICODE apis, and a lot of what I am about to say is partially gained knowledge that I have never had to exercise much, and forgive me if I am already stating the known, but... When the character conversion happens from UTF-16 to mbcs, some codepage defining how to encode characters in that more limited character space needs to be selected. The code page is either set explicitly by the program or defaults to one configured on the system (either explicitly by the user or say when installing the Korean distribution of windows). When testing on a system like mine, the default code page does not support characters like those in the test folder name, so there is no choice but to turn them in to garbage.... so it is up to the user to configure a code page that will support the string, or the programmer to query or otherwise detect what the proper code page to run in would be. No one, neither user, nor developer wants to do this, right? And even if the effort was expended, what if one must deal with strings/paths that span code-pages... where one part of the string requires one code-page to be correctly interpreted and the other requires another code-page?

                Dealing with this scenario is why Unicode was developed and invented. GetModuleFileNameW is more robust than it's ascii counterpart and should be used whenever possible. Python 3 did the right thing in changing the internal string representation from ascii (or something) to unicode, and it is my opinion that we should leave the conversion question out of the equation whenever possible... which is to say on modern Windows operating systems that are running against Python 3, use GetModuleFileNameW and all other unicode APIS.

                But what about Python 2.7 you ask? Well, it is possible to use the ascii apis when compiling in this configuration, and in this case, the conversion is again averted. The cx_freeze base does not need to do a conversion in this case, because the apis are not returning wide characters. The characters received are the characters passed along, and if the locale is correctly configured on the system, it should work as well as ever it has. I think that it is understood that without the Unicode support that is largely missing from python 2.7, there isn't really a universal, fool-proof solution anyway. At least on Windows. I believe that the default and perhaps only possible code page on OS X is utf-8, so that is covered in spite of the use of chars and the mbctows conversion is well-defined when necessary... I would expect that modern linux distributions work similarly (but this is more of a hope than anything else).

                Is there a reason to prevent fixing frozen programs running against Python 3+ just because frozen programs running against Python 2.7 can't be fixed just as completely on Windows?

                1. Anthony Tuininga repo owner

                  Short answer: no reason to prevent Python 3+ programs from working properly. Somewhat longer answer and a question: perhaps we can use the unicode APIs (Python 3 style code) across the board, even in Python 2.x. That might be a bit of an unexpected result but this is going to be cx_Freeze 5.x and for most people they will never even notice since few people play with the initscripts and the base executables. Thoughts?

                  1. Steven Velez author

                    I think that can be done, though I am not quite sure what the benefit is. The conditionalization needs to happen in either case in order to skip the conversion on one platform and not the other… this is what cxtext.h in the pull request attempts to do… and puttiing python 2.7 on windows in the unicode api condition would require additional contortion when deciding whether the conversions should be functional or no-ops in that case. Still it’s manageable.

                    I would propose that whatever fix ends up going in, it apply not only to GetModuleFilenameX, but also to the argv parameter on main (windows has a unicode version of main: wmain) since those values are totally unpredictable.

  4. Steven Velez author

    Hi @Anthony Tuininga.... I am not sure how you would like me to proceed on this. I have been shipping from a fork of CX freeze for about a year now and would like to get on an official release again (don't want to miss the 5.x boat). I am willing to restructure the pull request to meet your requirements, but I think I need a little more clarity.

    Thanks.

    1. Anthony Tuininga repo owner

      Hi Steven, after a VERY long hiatus I am finally getting back to this with a lot more time available to actually address things properly! I believe a good portion of what you wanted done is already done. I am fine with a fix that works for 3.x and doesn't make 2.7 any worse -- but would obviously prefer one that fixes both cases, of course! I also need something that is not Windows specific. Let me know if you have any questions.

  5. Steven Velez author

    Hi Anthony.. thanks so much for keeping this alive.

    With regards to fixing python 2.7... I don't think that is possible because the fix hinges on leveraging the wide-character/Unicode string support added in Python 3. Before that, you have to convert Wide character strings provided by windows to the narrow strings supported by python (even if implicitly), the conversion should already take in to account converting to the user's active code page. The problem only rears it's head when a code point is encountered that is not included in the code page, and while someone could try to find a code page that supports the characters in question, that would involve altering the active code page for some period of time (I am not sure it would have to be for the life of the application or just while the application needs to load items from the file system), and while this may be manageable, there is no guarantee that there would be any code page available which supports all the code points present in the string to be converted.

    The fix itself is windows-specific because the problem is windows specific... though it was checked to work with the other platforms. On those other platforms, one would handle converting the operating system-supplied narrow strings to the Python wide strings in Python 3, and leave them alone in python 2... just as before the fix. I think this works well on Mac because the incoming strings are UTF-8, so the conversion should work in any direction. I am not as familiar with Linux... would it be dependent on the selected filesystem more than the OS? Neither of those support anything like int main(int argc, wchar_t** argv) do they?

    Anyway, I am not officially on the project which uses CX_Freeze any more, so I will have to find some personal time to bring this PR back in line with the tip of CX_Freeze development or to convince management to allocate resources for this as a tech-debt project... though I have never had much success with that sort of thing.

    I did happen to notice that pull request #87 addresses part of the issue that this request does, though in a much more targeted manner. I took some time to evaluate whether the module loading code could likewise be made less invasive than what I propose here, but it seems like the necessary change has long tendrils, and I would hate to dirty up the code with copious ifdefs... so I wonder what has changed such that "a good portion of what you wanted done" has already been done. The ascii char version of GetModuleFileName() still looks like it is being used.

    I'll see if I can test the latest builds some time, but if the issue is still outstanding, I am not sure when I will be able to even reconcile the conflicts that now exist here.

  6. Anthony Tuininga repo owner

    I have just merged pull request #87 which does appear at least part of what you were trying to accomplish. I do see the call to GetModuleFileName() and am not sure if that will continue to cause problems. Do you have a case that demonstrates the problem? Is it simply by putting the frozen executable in a directory that contains something other than ASCII characters? If you have a particular test case you are using it would be helpful to include it here. I know this issue has been dragging on for a long time -- I am no longer working on a project that uses cx_Freeze either so my time is also more limited -- so I completely understand about not being able to resolve the conflicts. If you can at least provide a test case that would be appreciated.

  7. Steven Velez author

    Yes, the test case should be pretty much unchanged from the one posted here.

    FYI, my team updated to a more current CX_FREEZE in order to support python 3.5, and did apply some of these changes to that branch. But upon reading them, I am pretty sure that they way they were ported, it will not work with python 2.7 on windows.

  8. Anthony Tuininga repo owner

    Thanks. I'll be taking a closer look at this next week. Could you send me the patches you applied to support Python 3.5? If they don't inadvertently break Python 2.7 I'll definitely consider including them. Feel free to send them directly to my e-mail address instead of including them here.

  9. Steven Velez author

    Yeah, @Anthony Tuininga those changes look like they will do the trick. Thanks. I will see if someone can test them out for us.

    One comment: since it is assumed that Windows is always going to use Unicode, you have deliberately dropped support for the old Win9x series of OSes? Was that even supported before? :)