Clone wiki

rest-api-blueprint / CharacterEncoding

Character encoding

Unicode good practice:

  • Handle all strings internally as unicode objects.
  • Protect the boundaries of the API, which means:
    • decode the incoming text (hopefully with the right encoding scheme), and
    • always indicate the encoding of outgoing text.

The chosen libraries provide some support:

  • Flask takes some care.
  • JSON is always encoded Unicode with a default encoding of UTF-8 (see rfc4627).
  • The Requests library makes good guesses, but other HTTP clients may make different ones.

Ideally we would begin each Python file with:

from __future__ import unicode_literals

since as a start this will ensure that all internal literal strings are unicode. We might even fiddle with reload(sys) and set the default encoding scheme to None so that all encoding has to be done explicitly.

However many standard Python libraries are not routinely used with unicode rather than str, which leads to surprises. See below for an example. Similarly, fiddling with sys and the default encoding schema is unusual (and naughty!).

See for example:

Therefore we do not do the above, but still remain vigilant.

Handling incoming text

Text can come in through:

  • the URL
  • HTTP headers (including cookies)
  • the HTTP message body

The arguments passed to the routing functions and request.path are all unicode. However be aware that request.url is type str and the purl library also gives back str's. Fortunately request.args has str keys but unicode values. Note that unicode hashes to the same value as the str equivalent so dictionary lookup works correctly.

The keys and values of request.headers are both of type str.

By definition, is type str (i.e. text of unknown encoding). However as noted above the JSON libraries work entirely in unicode so request.json contains only unicode text.

Handling outgoing text

Text can leave the application server through:

  • HTTP headers (including cookies)
  • the HTTP message body

As with incoming, the headers are not explicitly encoded but just handled as raw data. The message body goes through the standard library json module which by default encodes to UTF-8. So the Content-Type: application/json effectively implies the character encoding.


This is a tricky area to get right.

The BDD tests provide some reassurance by storing and retrieving some non-ASCII data on the "commment" annotation for a person.

Note that the HTTP client code (starting with the Requests library) in the tests ultimately has to create a HTTP body. It does this by concatenating strings which is prone to encoding errors because the string concatenation is likely to invoke Python's default encoding of ASCII. This fails to encode non-ascii data of course. The solution is to ensure that all text passed into Requests has the correct encoding. This means that the HTTP headers are ASCII encoded and the JSON body is UTF8 encoded.