Pull requests

#27 Declined
Repository
indratalip
Branch
default
Repository
jython
Branch
default

Make SRE_STATE cache the code points for a given PyString

Author
  1. Indra Talip
Reviewers
Description

Caching the computed code points from the PyString improves performance for applications that run (possibly different) regexes repeatedly on the same PyString object.

In particular this drastically improves the performance of the HTMLParser module used by BeautifulSoup bringing the parse time for http://www.fixprotocol.org/specifications/fields/5000-5999 down from 500+ seconds to ~6 seconds.

The SRE_STATE code point cache is built using a cache spec that is sourced from the registry and thus is configurable by Jython users.

  • Learn about pull requests

Comments (9)

  1. Jim Baker

    I really like this solution as an interim fix, as I mentioned on jython-users. It's extremely simple and can be readily reversed when we have a proper solution that removes the need for toCodePoints.

    So it looks good to me, once this supplemented with maximum weight and a weight that is measured in the length of the cached codepoints array.

  2. Indra Talip author

    I've added a commit that defaults to using a weigher and specifies a maximum weight of 10M. The weigher can be disabled via the registry by removing maximumWeight from the registry setting thus allowing folks to use maximumSize if desired or simply no weigher.

  3. Jim Baker

    After "merging" another PR, which creates unnecessary history, looks like "declining" is the best approach, so doing that now. We will have this cleaned up shortly, now that we have a better sense of how to manage with respect to bitbucket mirroring vs the actual repo being at hg.python.org