Concerns on encodings

puzzlet avatarpuzzlet created an issue

I see irc/client.py assumes all the packets are encoded in UTF-8.

But in reality, non-UTF-8 texts are around: privmsg's are truncated by server by bytes hence sometimes broken, and some servers and channels still use their own local encodings other than UTF-8.

So I think the library should have an option for non-UTF-8 modes.

Comments (3)

  1. Jason R. Coombs

    By default, the IRC library does attempt to decode all incoming streams as UTF-8, but I acknowledge that there are cases where decoding is undesirable or a custom decoding option is desirable. To support these cases, since irc 3.4.2, the ServerConnection class may be customized. The 'buffer_class' attribute on the ServerConnection determines what class is used for buffering lines from the input stream. By default it is DecodingLineBuffer, but may be re-assigned with another class, such as irc.client.LineBuffer, which does not decode the lines and passes them through as byte strings. The 'buffer_class' attribute may be assigned for all instances of ServerConnection by overriding the class attribute::

    irc.client.ServerConnection.buffer_class = irc.client.LineBuffer
    

    or it may be overridden on a per-instance basis (as long as it's overridden before the connection is established)::

    server = irc.client.IRC().server()
    server.buffer_class = irc.client.LineBuffer
    server.connect()
    

    I've added a section to the README that documents these options.

    Does this interface provide the option you seek? If not, please re-open.

  2. puzzlet

    Thank you for the reply. It helped me a lot, but I've come up with another problem, mainly because I'm using Python 3.

    The library has somewhat mixed uses between bytes and str, and when you convert bytes to str implicitly it would result "b'this'".

    We should explicitly choose what to use between two kinds of strings, and I would like to recommend bytes. For example, the channel names are allowed to contain almost any sequences of bytes as specified by RFC 1459, so bytes should be suitable. But when you do that, every line would become problematic:

    • In irc.client.is_channel(): string[0] in "#&+!"
    • In irc.client.ServerConnection.join(): "JOIN %s%s" % (channel, (key and (" " + key)))
    • NickMask(prefix) when a privmsg event has occured

    So I'm trying to convert all the internal strings to bytes on my fork, in a similar fashion I've done to irclib: https://github.com/puzzlet/python-irclib

  3. Log in to comment
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.