Issue #189 resolved

regular expression term support

Thomas Waldmann
created an issue

i know regex support doesn't fit very well with indexed search (similar issue as with wildcard support, just more complex), but for some scenarios it would be nice to have.

Even if slow, it would be better than not supporting it at all - especially for projects using whoosh that had a slow non-indexed search before and rely on regex search in either their code or even in user content (like a wiki page that has a search macro call with a regex search term).

whoosh.query.Wildcard is doing a similar thing, so I guess it could be implemented in a similar way and even an implementation without any "prefix" optimization would help.

I guess it would also need something like a RegexPlugin for the qp, but not sure how a query would look like so it bests fits into whoosh.

For moin 1.x we had "foo.bar" as normal search term and "regex:foo.bar" for regex search (regex: could be shortened as long as it did not get ambiguous, e.g. re:).

Comments (7)

  1. Matt Chaput repo owner
    • changed status to open

    Since Wildcard is already doing a regular expression search (using the regex object produced by the glob function), it might be easy to just add some kind of use_regex option to Wildcard.

    regex:foo.bar wouldn't work as a general thing since that's the syntax for fields, although it would be possible with a custom parser plugin. Actually... a general "pseudo field that actually runs a transform function" plugin would be really cool. In addition to that, though, I'd probably need to add a special syntax using a specific Regex parser plugin.

    Possible syntaxes:

    `foo.bar`
    //foo.bar//
    r"foo.bar"
    

    But yes, it will be slow :). Since Python's regex library doesn't have a method for introspecting the regex objects (e.g. to find any literal prefix), it won't be possible to narrow the search... every regex will have to be tested against every term in the field.

  2. Thomas Waldmann reporter

    use_regex: sure, can be done with a flag, but then optimizing it from the outside (like detecting possible Prefix by looking at the regex string), wouldn't be as clean as in separate class.

    E.g. if someone used: foo.* it is quite simple to see that "foo" has nothing special and can be optimized to a Prefix term.

    speed: it'll likely still be faster than our simple iterate-over-every-object-in-backend-storage-and-throw-regex-at-it approach in moin 1.9, because whoosh index likely has much less overhead.

  3. Matt Chaput repo owner

    Consider the regular expression

    abercrombie|fitch

    In this case you can't find a literal prefix by taking all the characters up to the first special character.

  4. Log in to comment