shannon entropy lib code

Issue #48 closed
Former user created an issue

(Imported from Google Code)

Thomas.J.Waldmann wrote:

> What features would the enhancement add?

calculation of shannon entropy (bits/char [float] as well as whole-word total entropy [int])

> What parts of the project would this effect?

None.

It is just for sw developers so they can display the total entropy to users when they define a new password or use it in a password validator.

It's not a perfect or great solution (as there can be none, see docstring), but maybe sucks less and is more well-defined as primitive or made-up password quality metrics quick hacks usually invented out there for these purposes.

> Please provide any additional information below.

I have some code to contribute (if you are interested), see the codereview there:

https://codereview.appspot.com/8706044

Comments (6)

  1. Eli Collins repo owner
    • assigned issue to
    • changed status to open
    • changed type to enhancement
    • changed Milestone to 1.7

    (Imported from Google Code)

    [my apologies ahead of time for the length of this response]

    That'd fit in wonderfully with a password-generation module I'm working on for the next release (I should really get that branch cleaned up and pushed to the public passlib repo). As far as your code goes, it looks good, and I'm definitely interested in contributions. That said, I'd already been thinking about adding a password-strength estimator to that module, and my main problem is that like you stated, it's hard (if not impossible) to find a good solution, particularly because of the context-dependant nature of entropy calculations.

    I've been doing idle reading on the topic on and off for a while, and I feel like every algorithm I came across would grossly over- or under- estimate the strength of select passwords. This would in turn give users either a false sense of security, or cause them to ignore the warning entirely because they thought the measurement was wrong. If this was compounded by the application *forcing* it's assumptions on the user, it just made things worse.

    In particular: a direct Shannon entropy calculation is certainly the most transparent way of calculating password strength, implementable without the need for case studies and complex language- and charset- dependant metrics. But I'm uncomfortable particularly because of that property... I feel like it gives a false sense of security, since for example it would report "12345678" and "ydc4wZFQ" as being equally strong.

    Not that I've had or found any better ideas (and this paper on guessing entropy made me much less hopeful of any good prospects -- http://reusablesec.blogspot.com/2010/10/new-paper-on-password-security-metrics.html). But I wanted to lay out why I'm hesitant to add this (or any) strength measurement code, until I feel more confident of how good or bad it's recommendations will be :)

    Which does finally lead me to one question about your code -- how did you arrive at the "weak" and "strong" entropy levels? For my part, the main guideline I've been looking at is NIST 800-63 (http://csrc.nist.gov/publications/nistpubs/800-63/SP800-63V1_0_2.pdf), but that paper attaches so many other security assumptions I'm not sure if it's entropy recommendations are actually worthwhile for this case.

  2. Former user Account Deleted

    (Imported from Google Code)

    Thomas.J.Waldmann wrote:

    some notes:

    a) a guy who is much more deep into crypto than I am noted that I am NOT computing shannon entropy with the original code (because that always depends on the generating system, which we do not know enough about). this is why I updated the codereview accordingly.

    b) the current code can likely only tell when a password is weak or maybe-weak. it can't really tell when it is strong. so maybe i should rename strong_limit to maybe_weak_limit (that would be also more consistent with the weak_limit). due to that, it can not be used as only metrics for deciding about strong passwords, but maybe it can be used to reject some very weak passwords and warn about maybe-still-too-weak passwords. I think it fits for the goal to be better than just len(p) as metrics.

    c) those levels defined in the function header were just made up after some experimentation. i threw some pwgen passwords at it and adjusted so they aren't usually classified as maybe-still-to-weak.

  3. Eli Collins repo owner
    • changed status to open

    (Imported from Google Code)

    I hope you don't mind, I've gone ahead and merged your code into the passlib.pwd module. Happy to accept more patches if you're still revising it though. Everything should be in the main passlib hg repo (as well as my bitbucket mirror, which might be easier to fork -- https://bitbucket.org/ecollins/passlib).

    I'm hoping to get some more time by this weekend to work on the password strength estimation some more, I've got a few more algorithms I'm going to try to work in, and some related API changes.

    Thanks for the patch!

    - Eli

    re: a) My understanding was that what strength() calculates *is* a form of Shannon entropy: the "self information" entropy, measuring the minimum number of bits of information that need to be conveyed to transmit the string, without contextual knowledge of it's generator. But my knowledge of information theory is far far from perfect... I'm going to go study a good information theory textbook before the next passlib release, and make sure everything is correct :)

    re: b) I like the renaming away from "strong", it's definitely more accurate this way. I think the more general problem is that the current strength() test (as well as others I've seen) don't report an exact classification, but rather a potentially open-ended range with lower and upper bounds... e.g. "2" means "maybe-weak or better". I'm hoping that after implementing a few more measurement algorithms, a more generalized representation will pop out.

  4. Former user Account Deleted

    (Imported from Google Code)

    Thomas.J.Waldmann wrote:

    just a quick note about b): i wanted a discrete value with not too much possible values so it can be rendered e.g. on a web ui using colors or stars or whatever. if one needs a float, one can call the inner function.

  5. Eli Collins repo owner

    Getting this right is tricky, and I think zxcvbn, or one of it's python ports, is a better overall approach.

    So closing this issue (at least for now, unless some new motivation comes to light)

  6. Log in to comment