:mod:trac.util.text -- Text manipulation

The Unicode toolbox

Trac internals are almost exclusively dealing with Unicode text, represented by unicode instances. The main advantage of using unicode over UTF-8 encoded str (as this used to be the case before version 0.10), is that text transformation functions in the present module will operate in a safe way on individual characters, and won't risk to eventually cut a multi-byte sequence in the middle. Similar issues with Python string handling routines are avoided as well, like surprising results when splitting text in lines. For example, did you know that "Priorità" is encoded as 'Priorit\xc3\x0a' in UTF-8? Calling strip() on this value in some locales can cut away the trailing \x0a and it's no longer valid UTF-8.

The drawback is that most of the outside world, while eventually "Unicode", is definitely not unicode. This is why we need to convert back and forth between str and unicode at the boundaries of the system. And more often than not we even have to guess which encoding is used in the incoming str strings.

Encoding unicode to str is usually directly performed by calling encode() on the unicode instance, while decoding is preferably left to the to_unicode helper function, which converts str to unicode in a robust, guaranteed successful way.