Source

peps / pep-3116.txt

Full commit
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
PEP: 3116
Title: New I/O
Version: $Revision$
Last-Modified: $Date$
Author: Daniel Stutzbach <daniel@stutzbachenterprises.com>,
        Guido van Rossum <guido@python.org>,
        Mike Verdone <mike.verdone@gmail.com>
Status: Final
Type: Standards Track
Content-Type: text/x-rst
Created: 26-Feb-2007
Python-Version: 3.0
Post-History: 26-Feb-2007

Rationale and Goals
===================

Python allows for a variety of stream-like (a.k.a. file-like) objects
that can be used via ``read()`` and ``write()`` calls.  Anything that
provides ``read()`` and ``write()`` is stream-like.  However, more
exotic and extremely useful functions like ``readline()`` or
``seek()`` may or may not be available on every stream-like object.
Python needs a specification for basic byte-based I/O streams to which
we can add buffering and text-handling features.

Once we have a defined raw byte-based I/O interface, we can add
buffering and text handling layers on top of any byte-based I/O class.
The same buffering and text handling logic can be used for files,
sockets, byte arrays, or custom I/O classes developed by Python
programmers.  Developing a standard definition of a stream lets us
separate stream-based operations like ``read()`` and ``write()`` from
implementation specific operations like ``fileno()`` and ``isatty()``.
It encourages programmers to write code that uses streams as streams
and not require that all streams support file-specific or
socket-specific operations.

The new I/O spec is intended to be similar to the Java I/O libraries,
but generally less confusing.  Programmers who don't want to muck
about in the new I/O world can expect that the ``open()`` factory
method will produce an object backwards-compatible with old-style file
objects.


Specification
=============

The Python I/O Library will consist of three layers: a raw I/O layer,
a buffered I/O layer, and a text I/O layer.  Each layer is defined by
an abstract base class, which may have multiple implementations.  The
raw I/O and buffered I/O layers deal with units of bytes, while the
text I/O layer deals with units of characters.


Raw I/O
=======

The abstract base class for raw I/O is RawIOBase.  It has several
methods which are wrappers around the appropriate operating system
calls.  If one of these functions would not make sense on the object,
the implementation must raise an IOError exception.  For example, if a
file is opened read-only, the ``.write()`` method will raise an
``IOError``.  As another example, if the object represents a socket,
then ``.seek()``, ``.tell()``, and ``.truncate()`` will raise an
``IOError``.  Generally, a call to one of these functions maps to
exactly one operating system call.

    ``.read(n: int) -> bytes``

       Read up to ``n`` bytes from the object and return them.  Fewer
       than ``n`` bytes may be returned if the operating system call
       returns fewer than ``n`` bytes.  If 0 bytes are returned, this
       indicates end of file.  If the object is in non-blocking mode
       and no bytes are available, the call returns ``None``.

    ``.readinto(b: bytes) -> int``

       Read up to ``len(b)`` bytes from the object and stores them in
       ``b``, returning the number of bytes read.  Like .read, fewer
       than ``len(b)`` bytes may be read, and 0 indicates end of file.
       ``None`` is returned if a non-blocking object has no bytes
       available.  The length of ``b`` is never changed.

    ``.write(b: bytes) -> int``

        Returns number of bytes written, which may be ``< len(b)``.

    ``.seek(pos: int, whence: int = 0) -> int``

    ``.tell() -> int``

    ``.truncate(n: int = None) -> int``

    ``.close() -> None``

Additionally, it defines a few other methods:

    ``.readable() -> bool``

       Returns ``True`` if the object was opened for reading,
       ``False`` otherwise.  If ``False``, ``.read()`` will raise an
       ``IOError`` if called.

    ``.writable() -> bool``

       Returns ``True`` if the object was opened for writing,
       ``False`` otherwise.  If ``False``, ``.write()`` and
       ``.truncate()`` will raise an ``IOError`` if called.

    ``.seekable() -> bool``

       Returns ``True`` if the object supports random access (such as
       disk files), or ``False`` if the object only supports
       sequential access (such as sockets, pipes, and ttys).  If
       ``False``, ``.seek()``, ``.tell()``, and ``.truncate()`` will
       raise an IOError if called.

    ``.__enter__() -> ContextManager``

       Context management protocol.  Returns ``self``.

    ``.__exit__(...) -> None``

       Context management protocol.  Same as ``.close()``.

If and only if a ``RawIOBase`` implementation operates on an
underlying file descriptor, it must additionally provide a
``.fileno()`` member function.  This could be defined specifically by
the implementation, or a mix-in class could be used (need to decide
about this).

    ``.fileno() -> int``

       Returns the underlying file descriptor (an integer)

Initially, three implementations will be provided that implement the
``RawIOBase`` interface: ``FileIO``, ``SocketIO`` (in the socket
module), and ``ByteIO``.  Each implementation must determine whether
the object supports random access as the information provided by the
user may not be sufficient (consider ``open("/dev/tty", "rw")`` or
``open("/tmp/named-pipe", "rw")``).  As an example, ``FileIO`` can
determine this by calling the ``seek()`` system call; if it returns an
error, the object does not support random access.  Each implementation
may provided additional methods appropriate to its type.  The
``ByteIO`` object is analogous to Python 2's ``cStringIO`` library,
but operating on the new bytes type instead of strings.


Buffered I/O
============

The next layer is the Buffered I/O layer which provides more efficient
access to file-like objects.  The abstract base class for all Buffered
I/O implementations is ``BufferedIOBase``, which provides similar methods
to RawIOBase:

    ``.read(n: int = -1) -> bytes``

       Returns the next ``n`` bytes from the object.  It may return
       fewer than ``n`` bytes if end-of-file is reached or the object is
       non-blocking.  0 bytes indicates end-of-file.  This method may
       make multiple calls to ``RawIOBase.read()`` to gather the bytes,
       or may make no calls to ``RawIOBase.read()`` if all of the needed
       bytes are already buffered.

    ``.readinto(b: bytes) -> int``

    ``.write(b: bytes) -> int``

       Write ``b`` bytes to the buffer.  The bytes are not guaranteed to
       be written to the Raw I/O object immediately; they may be
       buffered.  Returns ``len(b)``.

    ``.seek(pos: int, whence: int = 0) -> int``

    ``.tell() -> int``

    ``.truncate(pos: int = None) -> int``

    ``.flush() -> None``

    ``.close() -> None``

    ``.readable() -> bool``

    ``.writable() -> bool``

    ``.seekable() -> bool``

    ``.__enter__() -> ContextManager``

    ``.__exit__(...) -> None``

Additionally, the abstract base class provides one member variable:

    ``.raw``

       A reference to the underlying ``RawIOBase`` object.

The ``BufferedIOBase`` methods signatures are mostly identical to that
of ``RawIOBase`` (exceptions: ``write()`` returns ``None``,
``read()``'s argument is optional), but may have different semantics.
In particular, ``BufferedIOBase`` implementations may read more data
than requested or delay writing data using buffers.  For the most
part, this will be transparent to the user (unless, for example, they
open the same file through a different descriptor).  Also, raw reads
may return a short read without any particular reason; buffered reads
will only return a short read if EOF is reached; and raw writes may
return a short count (even when non-blocking I/O is not enabled!),
while buffered writes will raise ``IOError`` when not all bytes could
be written or buffered.

There are four implementations of the ``BufferedIOBase`` abstract base
class, described below.


``BufferedReader``
------------------

The ``BufferedReader`` implementation is for sequential-access
read-only objects.  Its ``.flush()`` method is a no-op.


``BufferedWriter``
------------------

The ``BufferedWriter`` implementation is for sequential-access
write-only objects.  Its ``.flush()`` method forces all cached data to
be written to the underlying RawIOBase object.


``BufferedRWPair``
------------------

The ``BufferedRWPair`` implementation is for sequential-access
read-write objects such as sockets and ttys.  As the read and write
streams of these objects are completely independent, it could be
implemented by simply incorporating a ``BufferedReader`` and
``BufferedWriter`` instance.  It provides a ``.flush()`` method that
has the same semantics as a ``BufferedWriter``'s ``.flush()`` method.


``BufferedRandom``
------------------

The ``BufferedRandom`` implementation is for all random-access
objects, whether they are read-only, write-only, or read-write.
Compared to the previous classes that operate on sequential-access
objects, the ``BufferedRandom`` class must contend with the user
calling ``.seek()`` to reposition the stream.  Therefore, an instance
of ``BufferedRandom`` must keep track of both the logical and true
position within the object.  It provides a ``.flush()`` method that
forces all cached write data to be written to the underlying
``RawIOBase`` object and all cached read data to be forgotten (so that
future reads are forced to go back to the disk).

*Q: Do we want to mandate in the specification that switching between
reading and writing on a read-write object implies a .flush()?  Or is
that an implementation convenience that users should not rely on?*

For a read-only ``BufferedRandom`` object, ``.writable()`` returns
``False`` and the ``.write()`` and ``.truncate()`` methods throw
``IOError``.

For a write-only ``BufferedRandom`` object, ``.readable()`` returns
``False`` and the ``.read()`` method throws ``IOError``.


Text I/O
========

The text I/O layer provides functions to read and write strings from
streams.  Some new features include universal newlines and character
set encoding and decoding.  The Text I/O layer is defined by a
``TextIOBase`` abstract base class.  It provides several methods that
are similar to the ``BufferedIOBase`` methods, but operate on a
per-character basis instead of a per-byte basis.  These methods are:

    ``.read(n: int = -1) -> str``

    ``.write(s: str) -> int``

    ``.tell() -> object``

        Return a cookie describing the current file position.
        The only supported use for the cookie is with .seek()
        with whence set to 0 (i.e. absolute seek).

    ``.seek(pos: object, whence: int = 0) -> int``

        Seek to position ``pos``.  If ``pos`` is non-zero, it must
        be a cookie returned from ``.tell()`` and ``whence`` must be zero.

    ``.truncate(pos: object = None) -> int``

        Like ``BufferedIOBase.truncate()``, except that ``pos`` (if
        not ``None``) must be a cookie previously returned by ``.tell()``.

Unlike with raw I/O, the units for .seek() are not specified - some
implementations (e.g. ``StringIO``) use characters and others
(e.g. ``TextIOWrapper``) use bytes.  The special case for zero is to
allow going to the start or end of a stream without a prior
``.tell()``.  An implementation could include stream encoder state in
the cookie returned from ``.tell()``.

    
``TextIOBase`` implementations also provide several methods that are
pass-throughs to the underlaying ``BufferedIOBase`` objects:

    ``.flush() -> None``

    ``.close() -> None``

    ``.readable() -> bool``

    ``.writable() -> bool``

    ``.seekable() -> bool``

``TextIOBase`` class implementations additionally provide the
following methods:

    ``.readline() -> str``

        Read until newline or EOF and return the line, or ``""`` if
        EOF hit immediately.

    ``.__iter__() -> Iterator``

        Returns an iterator that returns lines from the file (which
        happens to be ``self``).

    ``.next() -> str``

        Same as ``readline()`` except raises ``StopIteration`` if EOF
        hit immediately.

Two implementations will be provided by the Python library.  The
primary implementation, ``TextIOWrapper``, wraps a Buffered I/O
object.  Each ``TextIOWrapper`` object has a property named
"``.buffer``" that provides a reference to the underlying
``BufferedIOBase`` object.  Its initializer has the following
signature:

    ``.__init__(self, buffer, encoding=None, errors=None, newline=None, line_buffering=False)``

        ``buffer`` is a reference to the ``BufferedIOBase`` object to
        be wrapped with the ``TextIOWrapper``.

        ``encoding`` refers to an encoding to be used for translating
        between the byte-representation and character-representation.
        If it is ``None``, then the system's locale setting will be
        used as the default.

        ``errors`` is an optional string indicating error handling.
        It may be set whenever ``encoding`` may be set.  It defaults
        to ``'strict'``.

        ``newline`` can be ``None``, ``''``, ``'\n'``, ``'\r'``, or
        ``'\r\n'``; all other values are illegal.  It controls the
        handling of line endings.  It works as follows:

        * On input, if ``newline`` is ``None``, universal newlines
          mode is enabled.  Lines in the input can end in ``'\n'``,
          ``'\r'``, or ``'\r\n'``, and these are translated into
          ``'\n'`` before being returned to the caller.  If it is
          ``''``, universal newline mode is enabled, but line endings
          are returned to the caller untranslated.  If it has any of
          the other legal values, input lines are only terminated by
          the given string, and the line ending is returned to the
          caller untranslated.  (In other words, translation to
          ``'\n'`` only occurs if ``newline`` is ``None``.)

        * On output, if ``newline`` is ``None``, any ``'\n'``
          characters written are translated to the system default
          line separator, ``os.linesep``.  If ``newline`` is ``''``,
          no translation takes place.  If ``newline`` is any of the
          other legal values, any ``'\n'`` characters written are
          translated to the given string.  (Note that the rules
          guiding translation are different for output than for
          input.)

        ``line_buffering``, if True, causes ``write()`` calls to imply
        a ``flush()`` if the string written contains at least one
        ``'\n'`` or ``'\r'`` character.  This is set by ``open()``
        when it detects that the underlying stream is a TTY device,
        or when a ``buffering`` argument of ``1`` is passed.

        Further notes on the ``newline`` parameter:

        * ``'\r'`` support is still needed for some OSX applications
          that produce files using ``'\r'`` line endings; Excel (when
          exporting to text) and Adobe Illustrator EPS files are the
          most common examples.

        * If translation is enabled, it happens regardless of which
          method is called for reading or writing.  For example,
          ``f.read()`` will always produce the same result as
          ``''.join(f.readlines())``.

        * If universal newlines without translation are requested on
          input (i.e. ``newline=''``), if a system read operation
          returns a buffer ending in ``'\r'``, another system read
          operation is done to determine whether it is followed by
          ``'\n'`` or not.  In universal newlines mode with
          translation, the second system read operation may be
          postponed until the next read request, and if the following
          system read operation returns a buffer starting with
          ``'\n'``, that character is simply discarded.

Another implementation, ``StringIO``, creates a file-like ``TextIO``
implementation without an underlying Buffered I/O object.  While
similar functionality could be provided by wrapping a ``BytesIO``
object in a ``TextIOWrapper``, the ``StringIO`` object allows for much
greater efficiency as it does not need to actually performing encoding
and decoding.  A String I/O object can just store the encoded string
as-is.  The ``StringIO`` object's ``__init__`` signature takes an
optional string specifying the initial value; the initial position is
always 0.  It does not support encodings or newline translations; you
always read back exactly the characters you wrote.


Unicode encoding/decoding Issues
--------------------------------

We should allow allow changing the encoding and error-handling
setting later.  The behavior of Text I/O operations in the face of
Unicode problems and ambiguities (e.g. diacritics, surrogates, invalid
bytes in an encoding) should be the same as that of the unicode
``encode()``/``decode()`` methods.  ``UnicodeError`` may be raised.

Implementation note: we should be able to reuse much of the
infrastructure provided by the ``codecs`` module.  If it doesn't
provide the exact APIs we need, we should refactor it to avoid
reinventing the wheel.


Non-blocking I/O
================

Non-blocking I/O is fully supported on the Raw I/O level only.  If a
raw object is in non-blocking mode and an operation would block, then
``.read()`` and ``.readinto()`` return ``None``, while ``.write()``
returns 0.  In order to put an object in non-blocking mode,
the user must extract the fileno and do it by hand.

At the Buffered I/O and Text I/O layers, if a read or write fails due
a non-blocking condition, they raise an ``IOError`` with ``errno`` set
to ``EAGAIN``.

Originally, we considered propagating up the Raw I/O behavior, but
many corner cases and problems were raised.  To address these issues,
significant changes would need to have been made to the Buffered I/O
and Text I/O layers.  For example, what should ``.flush()`` do on a
Buffered non-blocking object?  How would the user instruct the object
to "Write as much as you can from your buffer, but don't block"?  A
non-blocking ``.flush()`` that doesn't necessarily flush all available
data is counter-intuitive.  Since non-blocking and blocking objects
would have such different semantics at these layers, it was agreed to
abandon efforts to combine them into a single type.


The ``open()`` Built-in Function
================================

The ``open()`` built-in function is specified by the following
pseudo-code::

    def open(filename, mode="r", buffering=None, *, 
             encoding=None, errors=None, newline=None):
        assert isinstance(filename, (str, int))
        assert isinstance(mode, str)
        assert buffering is None or isinstance(buffering, int)
        assert encoding is None or isinstance(encoding, str)
        assert newline in (None, "", "\n", "\r", "\r\n")
        modes = set(mode)
        if modes - set("arwb+t") or len(mode) > len(modes):
            raise ValueError("invalid mode: %r" % mode)
        reading = "r" in modes
        writing = "w" in modes
        binary = "b" in modes
        appending = "a" in modes
        updating = "+" in modes
        text = "t" in modes or not binary
        if text and binary:
            raise ValueError("can't have text and binary mode at once")
        if reading + writing + appending > 1:
            raise ValueError("can't have read/write/append mode at once")
        if not (reading or writing or appending):
            raise ValueError("must have exactly one of read/write/append mode")
        if binary and encoding is not None:
            raise ValueError("binary modes doesn't take an encoding arg")
        if binary and errors is not None:
            raise ValueError("binary modes doesn't take an errors arg")
        if binary and newline is not None:
            raise ValueError("binary modes doesn't take a newline arg")
        # XXX Need to spec the signature for FileIO()
        raw = FileIO(filename, mode)
        line_buffering = (buffering == 1 or buffering is None and raw.isatty())
        if line_buffering or buffering is None:
            buffering = 8*1024  # International standard buffer size
            # XXX Try setting it to fstat().st_blksize
        if buffering < 0:
            raise ValueError("invalid buffering size")
        if buffering == 0:
            if binary:
                return raw
            raise ValueError("can't have unbuffered text I/O")
        if updating:
            buffer = BufferedRandom(raw, buffering)
        elif writing or appending:
            buffer = BufferedWriter(raw, buffering)
        else:
            assert reading
            buffer = BufferedReader(raw, buffering)
        if binary:
            return buffer
        assert text
        return TextIOWrapper(buffer, encoding, errors, newline, line_buffering)


Copyright
=========

This document has been placed in the public domain.



..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   coding: utf-8
   End: