postgresql full-text search on `match` should accept multiple parameters

Issue #3078 resolved

jvanasco created an issue 2014-06-10

I wanted to note a deficiency in the support for full-text search.

the various tsvector and tsquery commands accept 1 or 2 arguments

( string ) ( dictionary_type, string )

as in - http://www.postgresql.org/docs/9.1/static/functions-textsearch.html

while it possible to explicitly search with a to_tsvector()/to_tesquery that accepts a configuration argument ( as in https://bitbucket.org/zzzeek/sqlalchemy/issue/2800/unable-to-create-tsvector-functional-index ), it is not possible to sneak this parameter into the match command.

ie:

search_query = 'dogs & cats'
sqlalchemy.func.to_tsquery( text("'english'"), MyTable.mycolumn )\
.match( sqlalchemy.text("'english'"), search_query )

the target sql would be:

to_tsvector('english', mytable.mycolumn) @@ to_tsquery('english', 'dogs & cats')

The effects of this behavior:

loss of control in how the actual matching occurs
vectors may need to be computed in real-time, rendering indexes or precomputed values useless.

3078.patch

Comments (17)

Mike Bayer repo owner
OK well if you are PG specific you'd need to use tsvector() / tsquery() explicitly anyway, right? what's the alternative, having MATCH parse the string for operators? we'd need a full parser for that.
- 2014-06-10T16:06:20+00:00
Mike Bayer repo owner
- changed milestone to 0.x.xx
- 2014-06-10T16:06:37+00:00
jvanasco reporter
in order to do an optimized search using an index, it would be a postgresql specific query already - one would need to use tsvector on the column.

maybe there could be a PG specific ts_match function, that behaves like match, but just accepts 2 parameters.

just to be clear -- the only change I'm requesting is for match to be able to emit:
```
 @@ to_tsquery('english', 'dogs & cats')
```
as it now can only emit:

@@ to_tsquery('dogs & cats')

i'm not sure how this affects querying or planning with tsv columns; I haven't tried that yet.

the difference against a tsvector gin index though are huge - a factor of 1000 between a sequential and index scan on a moderately small table of a few million rows.

raw sql:

-- 46.439 ms SELECT title FROM indexed WHERE to_tsvector('english', title) @@ to_tsquery('english', 'dog') ;

-- 46954.893 ms SELECT title FROM indexed WHERE to_tsvector('english', title) @@ to_tsquery('dog') ;
- 2014-06-10T16:37:01+00:00

Mike Bayer repo owner

im just looking at the docs for this. I guess you are saying:

func.to_tsvector('fat cats ate rats').match('cat & rat')

so....right now you are just saying...

func.to_tsvector('english', 'fat cats ate rats').op('@@')(func.to_tsquery('english', 'cat & rat'))

So not hard to do for now, looking at http://www.postgresql.org/docs/9.1/static/textsearch-controls.html I guess you just want postgresql_regconfig as a kwarg, see attached patch:

from sqlalchemy.dialects import postgresql
from sqlalchemy.sql import column, func

expr = func.to_tsvector('english', 'fat cats ate rats').match('cat & rat', postgresql_regconfig='english')
print expr.compile(dialect=postgresql.dialect())

2014-06-10T17:03:18+00:00

Mike Bayer repo owner
- attached 3078.patch
- 2014-06-10T17:03:53+00:00

jvanasco reporter

Yeah, that's exactly it. It's doable in sqlalchemy now with lots of func/op calls (anything is). but it's ugly. the kwarg is a nice approach.

one question

you implemented it in the patch as...

binary.modifiers['postgresql_regconfig'],

but looking elsewhere on ( at visit_notilike_op_binary, specifically ) the code has (essentially) self.render_literal_value(binary.modifiers["escape"], sqltypes.STRINGTYPE)

are the binary.modifiers already cleaned, or should the kwarg be processed for security/other concerns like this :

def visit_match_op_binary(self, binary, operator, **kw):
    if "postgresql_regconfig" in binary.modifiers :
        regconfig = self.render_literal_value( binary.modifiers['postgresql_regconfig'] )
        if regconfig :
            return "%s @@ to_tsquery('%s', %s)" % (
                        self.process(binary.left, **kw),
                        regconfig,
                        self.process(binary.right, **kw)
                    )
    return "%s @@ to_tsquery(%s)" % (
                    self.process(binary.left, **kw),
                    self.process(binary.right, **kw))

2014-06-10T17:17:32+00:00

Mike Bayer repo owner
we can do it that way, sure. since it comes out that way on the left side anyway.
- 2014-06-10T17:33:54+00:00
Mike Bayer repo owner
if you want to write me a few short tests, just add them into test/dialect/postgresql/test_query -> MatchTest as assert_compile() tests, send me a PR and we can put it in.
- 2014-06-10T17:35:22+00:00
Mike Bayer repo owner
though maybe better a new series of tests in postgresql/test_compiler, the match tests there are spending too much time worrying about paramstyles
- 2014-06-10T17:37:05+00:00
jvanasco reporter
I'll take a stab at it tonight.
- 2014-06-10T17:44:41+00:00
jvanasco reporter
https://bitbucket.org/jvanasco/sqlalchemy/commits/82080fcb5f50b2e4f946f3f1e3ae8768d720db5b

slightly modified version of your patch + tests

after doing this, i'm not sure if this is the right approach for the regconfig. i'm encoding it as a string literal, but it could be a bind param like in the calls to func()
- 2014-06-10T23:28:27+00:00
Mike Bayer repo owner
should be fine like that. tests look good
- 2014-06-10T23:44:43+00:00
Mike Bayer repo owner
you have to turn that into a pull request or ill never find it again
- 2014-06-12T14:42:11+00:00
Mike Bayer repo owner
- changed status to resolved
- add postgresql_regconfig argument to PG dialect for match() operator, implements PG's to_tsvector() pattern. fixes ~~#3078~~
→ <<cset 80dec35efbba>>
- 2014-07-08T22:23:06+00:00
Mike Bayer repo owner
- add postgresql_regconfig argument to PG dialect for match() operator, implements PG's to_tsquery('regconfig', 'arg') pattern. fixes ~~#3078~~
→ <<cset c996b76d5b90>>
- 2014-07-08T22:23:06+00:00
Mike Bayer repo owner
- add postgresql_regconfig argument to PG dialect for match() operator, implements PG's to_tsquery('regconfig', 'arg') pattern. fixes ~~#3078~~
→ <<cset 6bb75283f16e>>
- 2014-07-08T22:23:06+00:00
Mike Bayer repo owner
- changed milestone to 1.x.xx
- 2014-08-25T20:42:57+00:00
Log in to comment

Assignee: –

Type: enhancement

Priority: major

Status: resolved

Component: postgres

Milestone: 1.x.xx

Votes: 0

Watchers: 2