Python3.4 str input and password not encrypted as utf-8

Issue #20 resolved
Dražen Lučanin
created an issue

In Python 3.4.3, using scrypt version 0.7.1, if I do:

encoded = scrypt.encrypt('bla', 'bla') # so both parameters as Python3's unicode strings
encoded.decode('utf-8')

I get back:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-29-e6e7e367adc6> in <module>()
----> 1 encoded.decode('utf-8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 16: invalid start byte

The docstring for scrypt.encrypt says:

Notes for Python 3:
  - `input` and `password` can be both str and bytes. If they are str
    instances, they will be encoded with utf-8
  - The result will be a bytes instance

So I would expect that I can decode it this way.

The only encoding that seems to succeed is latin1:

In [32]: encoded.decode('latin1')
Out[32]: "scrypt\x00\x12\x00\x00\x00\x08\x00\x00\x00\x02\x8b0\x06ôÎÌÌñ'4U$o\x1cS\x08¡³Àß:J\x1cDëYB-\x04UÌØB.S\x80\x1c\x1f\x0e}íñVO`t\x8bð\x82Ò\x7f)òúiDÉm¢SeDD\x9f\x93÷¼\x9c\x13RT»>FEÄ,P\x968\x98m'\x84ÝçGÎ\x9cç÷ö\x08µ\x84ÑZ\x17r9º!9Êø,¿°\x1egã5-&\x87"

Comments (3)

  1. Dražen Lučanin reporter

    For a workaround for my case, using something like this allows me to store the hashed password in the database as a unicode string.

    In [62]: data = scrypt.encrypt('a secret message', 'password', maxtime=0.1).decode('latin1')
    
    In [63]: data
    Out[63]: "scrypt\x00\r\x00\x00\x00\x08\x00\x00\x00\x01smÒI?Üß02©lNVn¬¿ªÔ\x19õ\x7fàcùj\x1aüë\x1f!\x9fPÞD¤³V¿Va[¦\x03©\\n=ÃÜñy¬3]5ûJ'Ãb¬ÕÎØ\x98Ð\x02åÓBÜ\x98\x86E'л\x07àtÐ:ænS;|\x0c,\x80\x8b¼µðâ³+\x97Ê\x06`!ÌÁ½ÑùRûY¤×6çÝgÝ'\x10ÛÅ\x85t\x94\x07\x01ó"
    
    In [64]: scrypt.decrypt(str.encode(data, 'latin1'), 'password', maxtime=0.1)
    Out[64]: 'a secret message'
    

    I'd rather have some control over the encoding, though, than it being implicitly decided for me. Even if I pass input and password as utf-8 encoded bytes, it can still only be decoded with 'latin1'.

    scrypt.encrypt('a secret message'.encode('utf-8'), 'password'.encode('utf-8'), maxtime=0.1).decode('latin1')
    
  2. Dražen Lučanin reporter

    Actually, to avoid issues with Django or the DB eating up some characters, I now base64-encode the bytes returned.

    import scrypt, os, base64
    
    def generate_password(length=255):
        chars = string.ascii_letters + string.digits
        return ''.join(choice(chars) for _ in range(length))
    
    user.password = base64.b64encode(scrypt.encrypt(
        generate_password(datalength), user.password
    ))
    user.save()
    
    #And later to verify it...
    
    try:
        scrypt.decrypt(base64.b64decode(user.password), 'guessed_password')
        return True
    except scrypt.error:
        return False
    

    More info in this SO answer.

  3. Magnus Hallin repo owner

    This is actually expected behaviour - the returned data should be considered random (with the exception of the scrypt preamble) and not decodable using any regular string encoding. It is not Unicode!

    If you need to store it somewhere that's Unicode aware (e.g. varchar in a database), then you need to encode it to e.g. base64 as you say. Another solution would be to change the database column to e.g. Postgres' bytea, which just stores raw bytes.

  4. Log in to comment