postgresql full-text search on `match` should accept multiple parameters

Issue #3078 resolved
jvanasco created an issue

I wanted to note a deficiency in the support for full-text search.

the various tsvector and tsquery commands accept 1 or 2 arguments

( string ) ( dictionary_type, string )

as in - http://www.postgresql.org/docs/9.1/static/functions-textsearch.html

while it possible to explicitly search with a to_tsvector()/to_tesquery that accepts a configuration argument ( as in https://bitbucket.org/zzzeek/sqlalchemy/issue/2800/unable-to-create-tsvector-functional-index ), it is not possible to sneak this parameter into the match command.

ie:

search_query = 'dogs & cats'
sqlalchemy.func.to_tsquery( text("'english'"), MyTable.mycolumn )\
.match( sqlalchemy.text("'english'"), search_query )

the target sql would be:

to_tsvector('english', mytable.mycolumn) @@ to_tsquery('english', 'dogs & cats')

The effects of this behavior:

  • loss of control in how the actual matching occurs
  • vectors may need to be computed in real-time, rendering indexes or precomputed values useless.

Comments (17)

  1. Mike Bayer repo owner

    OK well if you are PG specific you'd need to use tsvector() / tsquery() explicitly anyway, right? what's the alternative, having MATCH parse the string for operators? we'd need a full parser for that.

  2. jvanasco reporter

    in order to do an optimized search using an index, it would be a postgresql specific query already - one would need to use tsvector on the column.

    maybe there could be a PG specific ts_match function, that behaves like match, but just accepts 2 parameters.

    just to be clear -- the only change I'm requesting is for match to be able to emit:

     @@ to_tsquery('english', 'dogs & cats')
    

    as it now can only emit:

    @@ to_tsquery('dogs & cats')

    i'm not sure how this affects querying or planning with tsv columns; I haven't tried that yet.

    the difference against a tsvector gin index though are huge - a factor of 1000 between a sequential and index scan on a moderately small table of a few million rows.

    raw sql:

    -- 46.439 ms SELECT title FROM indexed WHERE to_tsvector('english', title) @@ to_tsquery('english', 'dog') ;

    -- 46954.893 ms SELECT title FROM indexed WHERE to_tsvector('english', title) @@ to_tsquery('dog') ;

  3. Mike Bayer repo owner

    im just looking at the docs for this. I guess you are saying:

    func.to_tsvector('fat cats ate rats').match('cat & rat')
    

    so....right now you are just saying...

    func.to_tsvector('english', 'fat cats ate rats').op('@@')(func.to_tsquery('english', 'cat & rat'))
    

    So not hard to do for now, looking at http://www.postgresql.org/docs/9.1/static/textsearch-controls.html I guess you just want postgresql_regconfig as a kwarg, see attached patch:

    from sqlalchemy.dialects import postgresql
    from sqlalchemy.sql import column, func
    
    expr = func.to_tsvector('english', 'fat cats ate rats').match('cat & rat', postgresql_regconfig='english')
    print expr.compile(dialect=postgresql.dialect())
    
  4. jvanasco reporter

    Yeah, that's exactly it. It's doable in sqlalchemy now with lots of func/op calls (anything is). but it's ugly. the kwarg is a nice approach.

    one question

    you implemented it in the patch as...

    binary.modifiers['postgresql_regconfig'],
    

    but looking elsewhere on ( at visit_notilike_op_binary, specifically ) the code has (essentially) self.render_literal_value(binary.modifiers["escape"], sqltypes.STRINGTYPE)

    are the binary.modifiers already cleaned, or should the kwarg be processed for security/other concerns like this :

    def visit_match_op_binary(self, binary, operator, **kw):
        if "postgresql_regconfig" in binary.modifiers :
            regconfig = self.render_literal_value( binary.modifiers['postgresql_regconfig'] )
            if regconfig :
                return "%s @@ to_tsquery('%s', %s)" % (
                            self.process(binary.left, **kw),
                            regconfig,
                            self.process(binary.right, **kw)
                        )
        return "%s @@ to_tsquery(%s)" % (
                        self.process(binary.left, **kw),
                        self.process(binary.right, **kw))
    
  5. Mike Bayer repo owner

    if you want to write me a few short tests, just add them into test/dialect/postgresql/test_query -> MatchTest as assert_compile() tests, send me a PR and we can put it in.

  6. Mike Bayer repo owner

    though maybe better a new series of tests in postgresql/test_compiler, the match tests there are spending too much time worrying about paramstyles

  7. Log in to comment