journal filtering doesn't work for repository name with group

Issue #261 resolved
Katsunori FUJIWARA
created an issue

Journal filtering with a condition "repository:foo/bar" can't pick up entries for repository "foo/bar" (= repository "bar" belonging to group "foo").

AFAIK, Whoosh library (at least 2.5.7) treats "repository:foo/bar" as "repository == foo AND repository == bar" unintentionally.

You can reproduce this unexpected(?) behavior of Whoosh library by Python script below. This is minimum part of building query from filter condition in Kallithea.

from whoosh.fields import (
    TEXT, ID, STORED, NUMERIC,
    BOOLEAN, Schema, FieldType, DATETIME
)
from whoosh.qparser.dateparse import DateParserPlugin
from whoosh.qparser.default import QueryParser

# from kallithea/lib/indexers/__init__.py
JOURNAL_SCHEMA = Schema(
    username=TEXT(),
    date=DATETIME(),
    action=TEXT(),
    repository=TEXT(),
    ip=TEXT(),
)

# from _journal_filter()@kallithea/controllers/admin/admin.py
qp = QueryParser('repository', schema=JOURNAL_SCHEMA)
qp.add_plugin(DateParserPlugin())

# with wildcard, treated as prefix matching
print repr(qp.parse(u'repository:foo/bar*'))
# without wildcard, treated as AND-ed conditions
print repr(qp.parse(u'repository:foo/bar'))

BTW, if you don't have group, of which name ends with "foo", suffix matching "repository:*foo/bar" may be used as an instant workaround.

Comments (6)

  1. Katsunori FUJIWARA reporter

    Today, I just found the reason why "repository:foo/bar" is treated as AND-ed conditions unintentionally.

    • QueryParser.parse() uses WordNode to represent "foo/bar" part of the query
    • WordNode causes tokenization before evaluation of query
    • "/" of "foo/bar" is treated as delimiter
    • splitted "foo" and "bar" are AND-ed, because search term "foo bar" means "foo AND bar" by default
    • using WordNode for field value in query is embedded in QueryParser.parse()

    (debug=True for QueryParser.parse() shows intermediate steps at parsing)

    Therefore, there is no instant way (e.g. passing boolean flags) to avoid this issue, IMHO.

    On the other hand, I also found that difference between WordNode and TextNode (base class of the former) is whether "tokenize" (of specified text) and "removestops" (= removing stop words) are enabled or not.

    Then, I tried to implement custom filter plugin, which disables "tokenize" and "removestops" of WordNode before evaluation of query, like below (I followed implementation of WhitespacePlugin).

    from whoosh.qparser import plugins, syntax
    
    class TextnizeFilter(plugins.Plugin):
        """Filter to make WordNode in syntax tree work like TextNode
        """
        def filters(self, parser):
            return [(self._textnize, 600)]
    
        def _textnize(self, parser, group):
            newgroup = group.empty_copy()
            for node in group:
                if isinstance(node, syntax.GroupNode):
                    # Textnize recursively
                    newgroup.append(self._textnize(parser, node))
                else:
                    if isinstance(node, syntax.WordNode):
                        # these disabling makes WordNode work like
                        # TextNode, because enabling these features
                        # is only difference between WordNode and
                        # TextNode
                        node.tokenize = False
                        node.removestops = False
                    newgroup.append(node)
            return newgroup
    
    qp.add_plugin(TextnizeFilter())
    

    At least for "Journal filtering", nobody should expect that field value is tokenized and treated as AND-ed conditions, because valid fields in "Journal filtering" schema are "username", "date", "action", "repository", and "ip" (AND-ed conditions for these fields always causes "not matched" against any journal entry).

    And QueryParser instance is always re-created at each journal filtering requests.

    Therefore, disabling "tokenize" and "removestops" by custom plugin shouldn't cause any issue for "Journal filtering", IMHO.

    I confirmed that adding this plugin to QueryParser treats "reository:foo/bar" as "repository == 'foo/bar'" as expected. But I strongly want to get this solution reviewed by Whoosh guru, because I'm newbie on whoosh.

  2. Mads Kiilerich

    Nice analysis!

    The help text describes briefly how the query language looks like. That makes it quite clear that it is a "query language" that will give exact results more than a "natural language query" that will try to guess what the user meant. Thus I don't think that automatic stemming is a good thing. Disabling it as you do is good.

    We haven't touched whoosh much in Kallithea. I think you just became our guru :-) I think we should use your approach if it works for you and you prepare a PR with some test coverage.

    (I never grokked or used filtering much. It seems a bit magic and quoting and escaping doesn't work as I would expect it to.)

  3. Katsunori FUJIWARA reporter

    While additional investigation, I found another (and simpler) solution for this issue!

    diff -r 16234f629cfb -r 09709840bc18 kallithea/lib/indexers/__init__.py
    --- a/kallithea/lib/indexers/__init__.py
    +++ b/kallithea/lib/indexers/__init__.py
    @@ -79,10 +79,10 @@
    
     # used only to generate queries in journal
     JOURNAL_SCHEMA = Schema(
    -    username=TEXT(),
    +    username=ID(),
         date=DATETIME(),
         action=TEXT(),
    -    repository=TEXT(),
    +    repository=ID(),
         ip=TEXT(),
     )
    

    Using ID instead of TEXT can solve this issue easily, because the former avoids tokenization while building query from search terms.

    BTW, ID field also avoid "removing stop words". This is useful, if user may use instant repository name like "a", "do", "does", and so on.

    You'll notice that ID field also avoids lowercase-ing. But this doesn't violate current case-insensitive search policy, because LOWER-ing in actual SQL query is achieved by get_filterion() or so in kallithea/controllers/admin/admin.py.

    This replacement should be safe with already existing DB instance, because:

    • JOURNAL_SCHEMA definition in kallithea/lib/indexers/init.py is used only to parse search terms, and to issue actual query
    • actual definition of DBMS table "user_logs" itself exists in kallithea/model/db.py (as UserLog class)

    Using ID for "username" also avoids similar issues for searching with "username:" prefixed condition.

  4. Log in to comment