Issue #1248 open

can URL use unicode ?

Ois Lone
created an issue

Python 3 can use Unicode as func name So I try to


import cherrypy
from mako.template import Template
class HelloWorld:
    @cherrypy.expose
    def index( self ):
        s = Template( filename="login.html" ).render()
        return s

    @cherrypy.expose
    def 登入( self, UserName=None, PassW=None ):
        if UserName == "" or PassW == "" :
            s = Template( "登入錯誤!" )
        else:
            s = Template( "你好 ! ${name}  密碼是 ${passw}" ).render( name=UserName, passw=PassW )

        return s

cherrypy.quickstart( HelloWorld() )

Mako file

## -*- coding: utf-8 -*-

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta
http-equiv="Content-Type"
content="text/html; charset=UTF-8"
>
<title>Login</title>
</head>
<body>
<form action="登入" >
<fieldset>
<legend>登錄 </legend>                                                
使用者名稱:<input name="UserName"  size="12" value="">                                               
密碼:<input name="PassW"  size="12" type="password" value="">
<input type="submit" value="登入">
<input type="reset" value="清除">
</fieldset>
</form>
</body>
</html>

Mako form action = "登入" is Unicode, So, I had wanted to call cheerypy def 登入() module. But press submit, show meessage...


404 Not Found

The path '/登入' was not found.

Traceback (most recent call last):
  File "/usr/lib/python3.3/site-packages/cherrypy/_cprequest.py", line 656, in respond
    response.body = self.handler()
  File "/usr/lib/python3.3/site-packages/cherrypy/lib/encoding.py", line 188, in __call__
    self.body = self.oldhandler(*args, **kwargs)
  File "/usr/lib/python3.3/site-packages/cherrypy/_cperror.py", line 386, in __call__
    raise self
cherrypy._cperror.NotFound: (404, "The path '/ç\x99»å\x85¥' was not found.")

In the shell, has message....

127.0.0.1 - - [28/Apr/2013:02:26:23] "GET / HTTP/1.1" 200 634 "" "Mozilla/5.0 (X11; Linux x86_64; rv:20.0) Gecko/20100101 Firefox/20.0"
127.0.0.1 - - [28/Apr/2013:02:26:23] "GET /favicon.ico HTTP/1.1" 200 1406 "" "Mozilla/5.0 (X11; Linux x86_64; rv:20.0) Gecko/20100101 Firefox/20.0"
127.0.0.1 - - [28/Apr/2013:02:26:30] "GET /\xc3\xa7\xc2\x99\xc2\xbb\xc3\xa5\xc2\x85\xc2\xa5?UserName=aaa&PassW=bbb HTTP/1.1" 404 1204 "http://localhost:8080/" "Mozilla/5.0 (X11; Linux x86_64; rv:20.0) Gecko/20100101 Firefox/20.0"

So the URL may use unicode? and can call unicode func name?

Comments (6)

  1. tsufeki
    >>> '登入'.encode('utf-8').decode('latin-1')
    \x99»å\x85¥'
    

    WSGI under Python 3 requires unicode strings masquerading as latin-1. I think the problem here lies in doing this masquerading twice. This line in _cptree.py:

    environ['PATH_INFO'] = path[len(sn.rstrip("/")):].encode('utf-8').decode('ISO-8859-1')
    

    does the masquerading even though path is taken from WSGI-compliant environ dict. Strings there were already mangled by wsgiserver.

    Also this seems to be a duplicate of issue #1194.

  2. beholdmyglory

    This thing is a mess of code at multiple places incorrectly assuming certain character encodings. I've attempted to track exactly what's going on here, and this is what I've found.

    For testing purposes, I used the following code:

    import cherrypy
    
    class Site:
        @cherrypy.expose
        def index(self, param):
            print(param)  
            return "Hello"
    
    if __name__ == "__main__":
        cherrypy.quickstart(Site())
    

    I tested it by issuing a GET /index/私 request using Firefox. The exact request sent by the browser was the following:

    GET /index/%E7%A7%81 HTTP/1.1
    Host: localhost:8080
    User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0
    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
    Accept-Language: en-US,en;q=0.5
    Accept-Encoding: gzip, deflate
    DNT: 1
    Connection: keep-alive
    Cache-Control: max-age=0
    

    Starting off CherryPy will extract the path, /index/%E7%A7%81, and convert the percent-encoded characters to byte sequences, yielding the bytestring b'/index/\xe7\xa7\x81'. It is important to note that percent-encoded URIs use UTF-8, meaning that this bytestring is a UTF-8-encoded string.

    Following this, the path will be decoded and encoded at three different points before it reaches the index method:

    • wsgiserver3.py:2053 -- incorrectly decode path as latin1; environ['PATH_INFO'] is '/index/ç§\x81'
    • _cptree.py:297 -- attempt to convert from UTF-8 to latin1; environ['PATH_INFO'] is '/index/ç§Â\x81'
    • _cpwsgi.py:324 -- convert back from latin1 to UTF-8 once; u_path is '/index/ç§\x81'

    Ultimately the string reaches the index method through the param parameter, with the value 'ç§\x81'. Running 'ç§\x81'.encode('latin1').decode('utf-8') yields '私' as expected.

    From what I understand the conversion in _cptree.py is done to be compliant with the WSGI standard, which dictates that the strings may only contain codepoints representable in latin1. Assuming this is correct I see at least two solutions to this problem:

    1. Decode the string as UTF-8 instead of latin1 at wsgiserver3.py:2053. This seems like the easiest solution. Changing this seems to cause two unit tests to fail, however.
    2. Instead of decoding and encoding strings multiple times, use byte strings everywhere internally and only decode them when absolutely necessary, for example when looking up user-defined methods. I don't know very much about the WSGI specification though, so I don't know if using bytes instead of str is compliant.

    There are probably better solutions to the problem, but I'm not familiar enough with CherryPy's codebase to be able to spot them.

  3. Fake Name

    I just wanted to add that I've run into this issue as well.

    Currently, I'm doing some horrible per-situation patching where I'm re-interpreting mis-decoded strings where I know I may have UTF-8 characters, and it's really kind of a mess, necessitating spot-fixes all over the place.

    FWIW, this is a problem on python 3 as well as 2, so the whole "unicode or GTFO" approach python3 has taken hasn't even affect the issue at all.

  4. Log in to comment