"Invalid utf8 character string" warning on insert into binary-type columns with MySQL 5.6

Issue #3804 closed
Andrew Mason created an issue

I have encountered this issue trying to insert a binary UUID (i.e. uuid.uuid4().bytes) into both varbinary(16). I have also encountered this issue trying to insert binary data into a mediumblob column, so I suspect the issue is affecting all binary-type columns.

Note that MySQL 5.5 happily executes the insert, but MySQL 5.6 (and likely later versions) spew warnings.

This django user has submitted an issue against MySQL here. The purported solution is instead of doing

    INSERT INTO `user` (uuid)
    VALUES (%s)

to do

    INSERT INTO `user` (uuid)
    VALUES (_binary %s)

which will cause MySQL to treat the data as binary rather than utf8-encoded data.

The problem here is that MySQL is trying to validate the value being populated as utf8, and if it's not valid utf8, is throwing the warning. One might argue that this doesn't need to be done, since it's populating a binary column type, but since we're passing a string (and not casting it as a _binary), MySQL stands by it's warning.

I am using the python mysqldb for my DBAPI, SQLAlchemy 1.0.12, and (attempting to, at least, as noted above) MySQL 5.6.

Attached is a traceback where the issue occurs.

Comments (7)

  1. Mike Bayer repo owner

    hello -

    while this is a different MySQL driver, this issue looks very similar to https://bitbucket.org/zzzeek/sqlalchemy/issues/3291/problem-using-binary-type-with-foreign-key, which was upstream. The driver is aware that the data should be binary encoded because it is wrapped in a binary() object.

    So the likely resolution here is that this would be a driver bug upstream. But also, I can't yet accept this bug report for further evaluation until it provides required information:

    • driver in use (stack trace says MySQLdb but that can mean mysqlclient or mysql-python)

    • Python version

    • demonstration case (also known as an mcve). As it stands, I've no idea if you're using the ORM, the Core, using a SQL string statement, if SQLAlchemy is made aware of the datatypes or not, etc.

    If you can't provide these things, please re-post on the mailing list where the community can pitch in to produce a demonstration case. thanks!

  2. Ben Hoyt

    Just an FYI, I was having this same issue with SQLAlchemy==1.0.15 and mysqlclient==1.3.7 on Python 3.5, and fixed it by overriding the bytes encoder of the underlying DBAPI connection. Note that you also have to override encoders[str] because otherwise the latter calls the _binary bytes encoder. Here's the code:

    # Override the MySQLdb bytes encoder to prefix the literal with "_binary" to
    # avoid "Invalid utf8 character string" warnings from MySQL. The MySQLdb
    # library should do this, but it doesn't. See more info at:
    # https://bugs.mysql.com/bug.php?id=79317
    
    from MySQLdb import string_literal
    from sqlalchemy.pool import Pool
    import sqlalchemy.event
    
    def bytes_encoder(b, dummy=None):
        return b'_binary' + string_literal(b)
    
    def str_encoder(s, dummy=None):
        return string_literal(str(s).encode('utf8'))
    
    @sqlalchemy.event.listens_for(Pool, 'connect')
    def set_encoders(dbapi_conn, conn_record):
        dbapi_conn.encoders[bytes] = bytes_encoder
        dbapi_conn.encoders[bytearray] = bytes_encoder
        dbapi_conn.encoders[str] = str_encoder
    
  3. Ben Hoyt

    Sure, below is a tiny script that reproduces this. As mentioned, I'm using SQLAlchemy==1.0.15 and mysqlclient==1.3.7 on Python 3.5.2 on Linux, against MySQL server version 5.6.32-78.1.

    This is definitely a mysqlclient bug and not SQLAlchemy's problem as such (mysqlclient should do add the "_binary" prefix for bytes automatically), but I just posted this to show how I worked around the problem in our code. A fix was actually attempted in mysqlclient 1.3.8, but it had a pretty serious regression so they reverted it in 1.3.9. See https://github.com/PyMySQL/mysqlclient-python/pull/106

    Anyway, here's my test case:

    """Reproduces the "Invalid utf8 character string" warning when inserting a
    binary value that happens not to be valid UTF-8 (for example, a binary SHA-1).
    
    When run, this script will output a warning like the following:
    
    /opt/python/py35/lib/python3.5/site-packages/sqlalchemy/engine/default.py:450: Warning: Invalid utf8 character string: '8BE74F'
      cursor.execute(statement, parameters)
    
    Before running, you must create the "test" database with the following SQL:
    
        CREATE DATABASE test CHARACTER SET utf8;
    
    If, however, you add the set_encoders event listener above, binary strings are
    prefixed with "_binary" as MySQL likes them, and the warning goes away.
    """
    
    import warnings
    
    from sqlalchemy import Binary, Column, Integer, String
    from sqlalchemy.ext.declarative import declarative_base
    from sqlalchemy.orm import sessionmaker
    import sqlalchemy
    
    Base = declarative_base()
    
    class Thing(Base):
        __tablename__ = 'things'
    
        id = Column(Integer, primary_key=True)
        name = Column(String(20))
        sha1 = Column(Binary(20))
    
    engine = sqlalchemy.create_engine('mysql://root:password@localhost/test?charset=utf8')
    Base.metadata.create_all(engine)
    session_factory = sessionmaker(bind=engine)
    
    session = session_factory()
    new_thing = Thing(name='NAME', sha1=bytes.fromhex('118be74fffadb55d63843dc1175793ca3e7374a6'))
    session.add(new_thing)
    session.commit()
    session.close()
    
  4. Mike Bayer repo owner

    shrugs, I have 1.3.7 installed in my Py3.5 locally here and I don't get a warning. I am on Mariadb 10.1 though, not mysql.

  5. Chris Frey

    I also ran into this problem, but discovered that the mysqlclient-python "mysqldb" driver version 1.3.10 supports the connection option binary_prefix=True which will prepend _binary to bytearray data.

    It is disabled by default.

  6. Log in to comment