Source

python-peps / pep-0393.txt

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
PEP: 393
Title: Flexible String Representation
Version: $Revision$
Last-Modified: $Date$
Author: Martin v. Löwis <martin@v.loewis.de>
Status: Final
Type: Standards Track
Content-Type: text/x-rst
Created: 24-Jan-2010
Python-Version: 3.3
Post-History:

Abstract
========

The Unicode string type is changed to support multiple internal
representations, depending on the character with the largest Unicode
ordinal (1, 2, or 4 bytes). This will allow a space-efficient
representation in common cases, but give access to full UCS-4 on all
systems. For compatibility with existing APIs, several representations
may exist in parallel; over time, this compatibility should be phased
out. The distinction between narrow and wide Unicode builds is
dropped.  An implementation of this PEP is available at [1]_.

Rationale
=========

There are two classes of complaints about the current implementation
of the unicode type: on systems only supporting UTF-16, users complain
that non-BMP characters are not properly supported. On systems using
UCS-4 internally (and also sometimes on systems using UCS-2), there is
a complaint that Unicode strings take up too much memory - especially
compared to Python 2.x, where the same code would often use ASCII
strings (i.e. ASCII-encoded byte strings). With the proposed approach,
ASCII-only Unicode strings will again use only one byte per character;
while still allowing efficient indexing of strings containing non-BMP
characters (as strings containing them will use 4 bytes per
character).

One problem with the approach is support for existing applications
(e.g. extension modules). For compatibility, redundant representations
may be computed. Applications are encouraged to phase out reliance on
a specific internal representation if possible. As interaction with
other libraries will often require some sort of internal
representation, the specification chooses UTF-8 as the recommended way
of exposing strings to C code.

For many strings (e.g. ASCII), multiple representations may actually
share memory (e.g. the shortest form may be shared with the UTF-8 form
if all characters are ASCII). With such sharing, the overhead of
compatibility representations is reduced. If representations do share
data, it is also possible to omit structure fields, reducing the base
size of string objects.

Specification
=============

Unicode structures are now defined as a hierarchy of structures,
namely::

  typedef struct {
    PyObject_HEAD
    Py_ssize_t length;
    Py_hash_t hash;
    struct {
        unsigned int interned:2;
        unsigned int kind:2;
        unsigned int compact:1;
        unsigned int ascii:1;
        unsigned int ready:1;
    } state;
    wchar_t *wstr;
  } PyASCIIObject;

  typedef struct {
    PyASCIIObject _base;
    Py_ssize_t utf8_length;
    char *utf8;
    Py_ssize_t wstr_length;
  } PyCompactUnicodeObject;

  typedef struct {
    PyCompactUnicodeObject _base;
    union {
        void *any;
        Py_UCS1 *latin1;
        Py_UCS2 *ucs2;
        Py_UCS4 *ucs4;
    } data;
  } PyUnicodeObject;

Objects for which both size and maximum character are known at
creation time are called "compact" unicode objects; character data
immediately follow the base structure. If the maximum character is
less than 128, they use the PyASCIIObject structure, and the UTF-8
data, the UTF-8 length and the wstr length are the same as the length
of the ASCII data. For non-ASCII strings, the PyCompactObject
structure is used. Resizing compact objects is not supported.

Objects for which the maximum character is not given at creation time
are called "legacy" objects, created through
PyUnicode_FromStringAndSize(NULL, length). They use the
PyUnicodeObject structure. Initially, their data is only in the wstr
pointer; when PyUnicode_READY is called, the data pointer (union) is
allocated. Resizing is possible as long PyUnicode_READY has not been
called.

The fields have the following interpretations:

- length: number of code points in the string (result of sq_length)
- interned: interned-state (SSTATE_*) as in 3.2
- kind: form of string
    + 00 => str is not initialized (data are in wstr)
    + 01 => 1 byte (Latin-1)
    + 10 => 2 byte (UCS-2)
    + 11 => 4 byte (UCS-4);
- compact: the object uses one of the compact representations
  (implies ready)
- ascii: the object uses the PyASCIIObject representation
  (implies compact and ready)
- ready: the canonical representation is ready to be accessed through
  PyUnicode_DATA and PyUnicode_GET_LENGTH. This is set either if the
  object is compact, or the data pointer and length have been
  initialized.
- wstr_length, wstr: representation in platform's wchar_t
  (null-terminated). If wchar_t is 16-bit, this form may use surrogate
  pairs (in which cast wstr_length differs form length).
  wstr_length differs from length only if there are surrogate pairs
  in the representation.
- utf8_length, utf8: UTF-8 representation (null-terminated).
- data: shortest-form representation of the unicode string.
  The string is null-terminated (in its respective representation).

All three representations are optional, although the data form is
considered the canonical representation which can be absent only
while the string is being created. If the representation is absent,
the pointer is NULL, and the corresponding length field may contain
arbitrary data.

The Py_UNICODE type is still supported but deprecated. It is always
defined as a typedef for wchar_t, so the wstr representation can double
as Py_UNICODE representation.

The data and utf8 pointers point to the same memory if the string uses
only ASCII characters (using only Latin-1 is not sufficient). The data
and wstr pointers point to the same memory if the string happens to
fit exactly to the wchar_t type of the platform (i.e. uses some
BMP-not-Latin-1 characters if sizeof(wchar_t) is 2, and uses some
non-BMP characters if sizeof(wchar_t) is 4).

String Creation
---------------

The recommended way to create a Unicode object is to use the function
PyUnicode_New::

   PyObject* PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar);

Both parameters must denote the eventual size/range of the strings.
In particular, codecs using this API must compute both the number of
characters and the maximum character in advance. An string is
allocated according to the specified size and character range and is
null-terminated; the actual characters in it may be uninitialized.

PyUnicode_FromString and PyUnicode_FromStringAndSize remain supported
for processing UTF-8 input; the input is decoded, and the UTF-8
representation is not yet set for the string.

PyUnicode_FromUnicode remains supported but is deprecated. If the
Py_UNICODE pointer is non-null, the data representation is set. If the
pointer is NULL, a properly-sized wstr representation is allocated,
which can be modified until PyUnicode_READY() is called (explicitly
or implicitly). Resizing a Unicode string remains possible until it
is finalized.

PyUnicode_READY() converts a string containing only a wstr
representation into the canonical representation. Unless wstr and data
can share the memory, the wstr representation is discarded after the
conversion. The macro returns 0 on success and -1 on failure, which
happens in particular if the memory allocation fails.

String Access
-------------

The canonical representation can be accessed using two macros
PyUnicode_Kind and PyUnicode_Data. PyUnicode_Kind gives one of the
values PyUnicode_WCHAR_KIND (0), PyUnicode_1BYTE_KIND (1),
PyUnicode_2BYTE_KIND (2), or PyUnicode_4BYTE_KIND (3). PyUnicode_DATA
gives the void pointer to the data. Access to individual characters
should use PyUnicode_{READ|WRITE}[_CHAR]:

 - PyUnicode_READ(kind, data, index)
 - PyUnicode_WRITE(kind, data, index, value)
 - PyUnicode_READ_CHAR(unicode, index)

All these macros assume that the string is in canonical form;
callers need to ensure this by calling PyUnicode_READY.

A new function PyUnicode_AsUTF8 is provided to access the UTF-8
representation. It is thus identical to the existing
_PyUnicode_AsString, which is removed. The function will compute the
utf8 representation when first called. Since this representation will
consume memory until the string object is released, applications
should use the existing PyUnicode_AsUTF8String where possible
(which generates a new string object every time). APIs that implicitly
converts a string to a char* (such as the ParseTuple functions) will
use PyUnicode_AsUTF8 to compute a conversion.

New API
-------

This section summarizes the API additions.

Macros to access the internal representation of a Unicode object
(read-only):

- PyUnicode_IS_COMPACT_ASCII(o), PyUnicode_IS_COMPACT(o),
  PyUnicode_IS_READY(o)
- PyUnicode_GET_LENGTH(o)
- PyUnicode_KIND(o), PyUnicode_CHARACTER_SIZE(o),
  PyUnicode_MAX_CHAR_VALUE(o)
- PyUnicode_DATA(o), PyUnicode_1BYTE_DATA(o), PyUnicode_2BYTE_DATA(o),
  PyUnicode_4BYTE_DATA(o)

Character access macros:

- PyUnicode_READ(kind, data, index), PyUnicode_READ_CHAR(o, index)
- PyUnicode_WRITE(kind, data, index, value)

Other macros:

- PyUnicode_READY(o)
- PyUnicode_CONVERT_BYTES(from_type, to_type, begin, end, to)

String creation functions:

- PyUnicode_New(size, maxchar)
- PyUnicode_FromKindAndData(kind, data, size)
- PyUnicode_Substring(o, start, end)

Character access utility functions:

- PyUnicode_GetLength(o), PyUnicode_ReadChar(o, index),
  PyUnicode_WriteChar(o, index, character)
- PyUnicode_CopyCharacters(to, to_start, from, from_start, how_many)
- PyUnicode_FindChar(str, ch, start, end, direction)

Representation conversion:

- PyUnicode_AsUCS4(o, buffer, buflen)
- PyUnicode_AsUCS4Copy(o)
- PyUnicode_AsUnicodeAndSize(o, size_out)
- PyUnicode_AsUTF8(o)
- PyUnicode_AsUTF8AndSize(o, size_out)

UCS4 utility functions:

- Py_UCS4_{strlen, strcpy, strcat, strncpy, strcmp, strncpy, strcmp,
  strncmp, strchr, strrchr}

Stable ABI
----------

The following functions are added to the stable ABI (PEP 384), as they
are independent of the actual representation of Unicode objects:
PyUnicode_New, PyUnicode_Substring, PyUnicode_GetLength,
PyUnicode_ReadChar, PyUnicode_WriteChar, PyUnicode_Find,
PyUnicode_FindChar.

GDB Debugging Hooks
-------------------
Tools/gdb/libpython.py contains debugging hooks that embed knowledge
about the internals of CPython's data types, include PyUnicodeObject
instances.  It has been updated to track the change.

Deprecations, Removals, and Incompatibilities
---------------------------------------------

While the Py_UNICODE representation and APIs are deprecated with this
PEP, no removal of the respective APIs is scheduled. The APIs should
remain available at least five years after the PEP is accepted; before
they are removed, existing extension modules should be studied to find
out whether a sufficient majority of the open-source code on PyPI has
been ported to the new API. A reasonable motivation for using the
deprecated API even in new code is for code that shall work both on
Python 2 and Python 3.

The following macros and functions are deprecated:

- PyUnicode_FromUnicode
- PyUnicode_GET_SIZE, PyUnicode_GetSize, PyUnicode_GET_DATA_SIZE,
- PyUnicode_AS_UNICODE, PyUnicode_AsUnicode, PyUnicode_AsUnicodeAndSize
- PyUnicode_COPY, PyUnicode_FILL, PyUnicode_MATCH
- PyUnicode_Encode, PyUnicode_EncodeUTF7, PyUnicode_EncodeUTF8,
  PyUnicode_EncodeUTF16, PyUnicode_EncodeUTF32,
  PyUnicode_EncodeUnicodeEscape, PyUnicode_EncodeRawUnicodeEscape,
  PyUnicode_EncodeLatin1, PyUnicode_EncodeASCII,
  PyUnicode_EncodeCharmap, PyUnicode_TranslateCharmap,
  PyUnicode_EncodeMBCS, PyUnicode_EncodeDecimal,
  PyUnicode_TransformDecimalToASCII
- Py_UNICODE_{strlen, strcat, strcpy, strcmp, strchr, strrchr}
- PyUnicode_AsUnicodeCopy
- PyUnicode_GetMax

_PyUnicode_AsDefaultEncodedString is removed. It previously returned a
borrowed reference to an UTF-8-encoded bytes object. Since the unicode
object cannot anymore cache such a reference, implementing it without
leaking memory is not possible. No deprecation phase is provided,
since it was an API for internal use only.

Extension modules using the legacy API may inadvertently call
PyUnicode_READY, by calling some API that requires that the object is
ready, and then continue accessing the (now invalid) Py_UNICODE
pointer. Such code will break with this PEP. The code was already
flawed in 3.2, as there is was no explicit guarantee that the
PyUnicode_AS_UNICODE result would stay valid after an API call (due to
the possibility of string resizing). Modules that face this issue
need to re-fetch the Py_UNICODE pointer after API calls; doing
so will continue to work correctly in earlier Python versions.

Discussion
==========

Several concerns have been raised about the approach presented here:

It makes the implementation more complex. That's true, but considered
worth it given the benefits.

The Py_UNICODE representation is not instantaneously available,
slowing down applications that request it. While this is also true,
applications that care about this problem can be rewritten to use the
data representation.

Performance
-----------

Performance of this patch must be considered for both memory
consumption and runtime efficiency. For memory consumption, the
expectation is that applications that have many large strings will see
a reduction in memory usage. For small strings, the effects depend on
the pointer size of the system, and the size of the Py_UNICODE/wchar_t
type. The following table demonstrates this for various small ASCII
and Latin-1 string sizes and platforms.

+-------+---------------------------------+---------------------------------+
|string | Python 3.2                      | This PEP                        |
|size   +----------------+----------------+----------------+----------------+
|       | 16-bit wchar_t | 32-bit wchar_t |   ASCII        |   Latin-1      |
|       +---------+------+--------+-------+--------+-------+--------+-------+
|       | 32-bit  |64-bit| 32-bit |64-bit | 32-bit |64-bit | 32-bit |64-bit |
+-------+---------+------+--------+-------+--------+-------+--------+-------+
|1      | 32      | 64   | 40     |  64   | 32     | 56    | 40     | 80    |
+-------+---------+------+--------+-------+--------+-------+--------+-------+
|2      | 40      | 64   | 40     |  72   | 32     | 56    | 40     | 80    |
+-------+---------+------+--------+-------+--------+-------+--------+-------+
|3      | 40      | 64   | 48     |  72   | 32     | 56    | 40     | 80    |
+-------+---------+------+--------+-------+--------+-------+--------+-------+
|4      | 40      | 72   | 48     |  80   | 32     | 56    | 48     | 80    |
+-------+---------+------+--------+-------+--------+-------+--------+-------+
|5      | 40      | 72   | 56     |  80   | 32     | 56    | 48     | 80    |
+-------+---------+------+--------+-------+--------+-------+--------+-------+
|6      | 48      | 72   | 56     |  88   | 32     | 56    | 48     | 80    |
+-------+---------+------+--------+-------+--------+-------+--------+-------+
|7      | 48      | 72   | 64     |  88   | 32     | 56    | 48     | 80    |
+-------+---------+------+--------+-------+--------+-------+--------+-------+
|8      | 48      | 80   | 64     |  96   | 40     | 64    | 48     | 88    |
+-------+---------+------+--------+-------+--------+-------+--------+-------+

The runtime effect is significantly affected by the API being
used. After porting the relevant pieces of code to the new API,
the iobench, stringbench, and json benchmarks see typically
slowdowns of 1% to 30%; for specific benchmarks, speedups may
happen as may happen significantly larger slowdowns.

In actual measurements of a Django application ([2]_), significant
reductions of memory usage could be found. For example, the storage
for Unicode objects reduced to 2216807 bytes, down from 6378540 bytes
for a wide Unicode build, and down from 3694694 bytes for a narrow
Unicode build (all on a 32-bit system). This reduction came from the
prevalence of ASCII strings in this application; out of 36,000 strings
(with 1,310,000 chars), 35713 where ASCII strings (with 1,300,000
chars). The sources for these strings where not further analysed;
many of them likely originate from identifiers in the library, and
string constants in Django's source code.

In comparison to Python 2, both Unicode and byte strings need to be
accounted. In the test application, Unicode and byte strings combined
had a length of 2,046,000 units (bytes/chars) in 2.x, and 2,200,000
units in 3.x. On a 32-bit system, where the 2.x build used 32-bit
wchar_t/Py_UNICODE, the 2.x test used 3,620,000 bytes, and the 3.x
build 3,340,000 bytes. This reduction in 3.x using the PEP compared
to 2.x only occurs when comparing with a wide unicode build.

Porting Guidelines
==================

Only a small fraction of C code is affected by this PEP, namely code
that needs to look "inside" unicode strings.  That code doesn't
necessarily need to be ported to this API, as the existing API will
continue to work correctly. In particular, modules that need to
support both Python 2 and Python 3 might get too complicated when
simultaneously supporting this new API and the old Unicode API.

In order to port modules to the new API, try to eliminate
the use of these API elements:

- the Py_UNICODE type,
- PyUnicode_AS_UNICODE and PyUnicode_AsUnicode,
- PyUnicode_GET_SIZE and PyUnicode_GetSize, and
- PyUnicode_FromUnicode.

When iterating over an existing string, or looking at specific
characters, use indexing operations rather than pointer arithmetic;
indexing works well for PyUnicode_READ(_CHAR) and PyUnicode_WRITE. Use
void* as the buffer type for characters to let the compiler detect
invalid dereferencing operations. If you do want to use pointer
arithmetics (e.g. when converting existing code), use (unsigned)
char* as the buffer type, and keep the element size (1, 2, or 4) in a
variable. Notice that (1<<(kind-1)) will produce the element size
given a buffer kind.

When creating new strings, it was common in Python to start of with a
heuristical buffer size, and then grow or shrink if the heuristics
failed. With this PEP, this is now less practical, as you need not
only a heuristics for the length of the string, but also for the
maximum character.

In order to avoid heuristics, you need to make two passes over the
input: once to determine the output length, and the maximum character;
then allocate the target string with PyUnicode_New and iterate over
the input a second time to produce the final output. While this may
sound expensive, it could actually be cheaper than having to copy the
result again as in the following approach.

If you take the heuristical route, avoid allocating a string meant to
be resized, as resizing strings won't work for their canonical
representation.  Instead, allocate a separate buffer to collect the
characters, and then construct a unicode object from that using
PyUnicode_FromKindAndData. One option is to use Py_UCS4 as the buffer
element, assuming for the worst case in character ordinals. This will
allow for pointer arithmetics, but may require a lot of memory.
Alternatively, start with a 1-byte buffer, and increase the element
size as you encounter larger characters. In any case,
PyUnicode_FromKindAndData will scan over the buffer to verify the
maximum character.

For common tasks, direct access to the string representation may not
be necessary: PyUnicode_Find, PyUnicode_FindChar, PyUnicode_Ord, and
PyUnicode_CopyCharacters help in analyzing and creating string
objects, operating on indexes instead of data pointers.

References
==========

.. [1] PEP 393 branch
       https://bitbucket.org/t0rsten/pep-393
.. [2] Django measurement results
       http://www.dcl.hpi.uni-potsdam.de/home/loewis/djmemprof/

Copyright
=========

This document has been placed in the public domain.


..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   coding: utf-8
   End:
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.