Add support for default_internal result transcoding

Issue #33 resolved
Michael Granger repo owner created an issue

Via Yehuda Katz in [[|his blog article about encodings]]:

[I]t is still possible that some non-BINARY data sneaks over the boundary and into our Rails application from a non-UTF-8 source.

For this scenario, Ruby 1.9 provides an option called Encoding.default_internal, which allows the user to specify an preferred encoding for Strings. Ruby itself and Ruby’s standard libraries respect this option, so even if, for instance, it opens some IO encoded in ISO-8859-1, it will give the data to the Ruby program transcoded to the preferred encoding.

Libraries, such as database drivers, should also support this option, which means that even if the database is somehow set up to receive UTF-8 String, the driver should convert those String transparently to the preferred encoding before handing it to the program.

Rails can take advantage of this by setting the default_internal to UTF-8, which will then ensure that String from non-UTF-8 sources still make their way into Rails encoded as UTF-8.

I'll add this ASAP, and 0.9.1 will include support for it.

Comments (4)

  1. Michael Granger reporter
    • changed status to open

    There are two possible routes I see to implementing this:

    • Use Postgres's built-in automatic character-set conversion (i.e., call PQsetClientEncoding() on the connection if rb_default_internal_encoding() returns something other than NULL or Qnil)
    • or, do the conversion to the default_internal encoding when the value is fetched.

    I'm inclined to let Postgres do it. That way there won't be any chance of double-conversion, the SQL_ASCII/ASCII_8BIT situation is handled gracefully, and it follows the "error early" principle. The only possible downside that I can see now is if Postgres's encoding conversion table isn't as complete as Ruby's, there might be a surprising failed transcoding error in Postgres where there isn't in other db drivers, or vice-versa.

  2. Si

    Ha! I was just reading that article and thinking of pg - then I saw your comment and the end of the post. Thanks for being on the ball with this! I'm being bitten by some UTF-8 issues and it's great to see this focus on getting things right for 1.9.2 / Rails 3.0

  3. Michael Granger reporter

    Fixed (at least for the synchronous API) in 92cc211ef553 .

    Since the connection is assumed to be managed by the caller in the async API, setting the encoding will need to be done manually as well (at least for now). I've added a comment to that effect, along with a snippet of how to set PostgreSQL's 'client_encoding' to the same as Encoding.default_internal.

  4. Log in to comment