Issue #946 resolved

Problem with encoded text in multipart/form-data

Anonymous created an issue

The code below is a very slightly modified version of the cherrypy Page 0 tutorial which presents a multipart/form-data form. It only has one field -- a textarea. In reality, I also have a file upload and other fields, but they're not necessary to present the problem. I enter non-ascii characters into the text area and submit. Although I have set as much as I believe in the way of encoding headers, I get a cherrypy traceback, reproduced below. This only occurs with multipart/form-data forms: a conventional form with no enctype behaves ok. Reproduced below is the cherrypy code, the traceback, and the headers (courtesy of the LiveHeaders addon) which Firefox is sending with the request. (IE behaves the same way, FWIW).

I'm only inserting one character into the form: U+201C, the infamous left quotation mark. The POST header seems to be encoding it correctly as UTF-8 (that's "\xe2\x80\x9c" you can see there) but cherrypy doesn't seem to have any way of picking that up.

I'm running against the current Svn HEAD (r2470). Christian Wyglendowski confirmed on the mailing list that the code works ok in 3.1.2, producing the expected:

{'text': '\xe2\x80\x9c', 'submit': 'Create'}

Code: {{{ import cherrypy

FORM = """ <form method="POST" enctype="multipart/form-data" accept- charset="utf-8"> <textarea name="text" value=""></textarea> <p> <input type="submit" id="create" name="submit" value="Create" /> <input type="submit" id="cancel" name="submit" value="Cancel" /> </p> </form> """

class HelloWorld: def index(self, **kwargs): if kwargs: return repr (kwargs) else: return FORM = True

cherrypy.config.update ( { "global" : { "tools.encode.on" : True, "tools.encode.encoding" : "utf-8", }, } ) cherrypy.quickstart(HelloWorld()) }}}

TRACEBACK: {{{ Traceback (most recent call last): File "c:\work_in_progress\cherrypy\", line 646, in respond self.body.process() File "c:\work_in_progress\cherrypy\", line 595, in process super(RequestBody, self).process() File "c:\work_in_progress\cherrypy\", line 281, in process proc(self) File "c:\work_in_progress\cherrypy\", line 82, in process_multipart_form_data process_multipart(entity) File "c:\work_in_progress\cherrypy\", line 76, in process_multipart part.process() File "c:\work_in_progress\cherrypy\", line 279, in process self.default_proc() File "c:\work_in_progress\cherrypy\", line 398, in default_proc result = self.read_lines_to_boundary() File "c:\work_in_progress\cherrypy\", line 387, in read_lines_to_boundary result = result.decode(self.encoding) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128) }}}

HEADERS: {{{ POST / HTTP/1.1 Host: localhost:8080 User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv: Gecko/20090715 Firefox/3.5.1 (.NET CLR 3.5.30729) Accept: text/html,application/xhtml+xml,application/xml;q=0.9,/ ;q=0.8 Accept-Language: en-gb,en;q=0.5 Accept-Encoding: gzip,deflate Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Keep-Alive: 300 Connection: keep-alive Referer: http://localhost:8080/ Content-Type: multipart/form-data; boundary=---------------------------16405529922056 Content-Length: 246 -----------------------------16405529922056 Content-Disposition: form-data; name="text"

“ -----------------------------16405529922056 Content-Disposition: form-data; name="submit"

Create -----------------------------16405529922056--

HTTP/1.x 500 Internal Server Error Date: Tue, 28 Jul 2009 09:33:26 GMT Content-Length: 1886 Content-Type: text/html;charset=utf-8 Server: CherryPy/3.2.0


Reported by

Comments (6)

  1. Robert Brewer

    This worked in 3.1 because `tools.decode` used a universal "default_encoding" argument, set to 'utf-8'.

    This does *not* work in 3.2 because we're actually following the MIME spec now which says the default is US-ASCII.

    The fix would involve allowing `cherrypy.request.body.default_encoding` to trickle down to multipart parts.

  2. Robert Brewer

    Fixed in trunk in [2495]. Needs port to python3.

    I replaced the 3 Entity attributes (force_encoding, encoding, default_encoding) with a single list: "attempt_charsets". Benefits:

    1. Multiple charsets can be attempted. 2. The encoding declared in the Content-Type request header (if any) can be both preceded by app-specified charsets, and also followed by them or by framework defaults. 3. The 'decode' Tool from 3.1 has been reinstated with a backward-compatible API. It and any user tools which wish to modify request entity parsing or decoding can run at the 'before_request_body' hook.

  3. Log in to comment