postgresql full-text search on `match` should accept multiple parameters
I wanted to note a deficiency in the support for full-text search.
the various tsvector and tsquery commands accept 1 or 2 arguments
( string ) ( dictionary_type, string )
as in - http://www.postgresql.org/docs/9.1/static/functions-textsearch.html
while it possible to explicitly search with a to_tsvector()/to_tesquery that accepts a configuration argument ( as in https://bitbucket.org/zzzeek/sqlalchemy/issue/2800/unable-to-create-tsvector-functional-index ), it is not possible to sneak this parameter into the match
command.
ie:
search_query = 'dogs & cats'
sqlalchemy.func.to_tsquery( text("'english'"), MyTable.mycolumn )\
.match( sqlalchemy.text("'english'"), search_query )
the target sql would be:
to_tsvector('english', mytable.mycolumn) @@ to_tsquery('english', 'dogs & cats')
The effects of this behavior:
- loss of control in how the actual matching occurs
- vectors may need to be computed in real-time, rendering indexes or precomputed values useless.
Comments (17)
-
repo owner -
repo owner - changed milestone to 0.x.xx
-
reporter in order to do an optimized search using an index, it would be a postgresql specific query already - one would need to use
tsvector
on the column.maybe there could be a PG specific
ts_match
function, that behaves likematch
, but just accepts 2 parameters.just to be clear -- the only change I'm requesting is for match to be able to emit:
@@ to_tsquery('english', 'dogs & cats')
as it now can only emit:
@@ to_tsquery('dogs & cats')
i'm not sure how this affects querying or planning with tsv columns; I haven't tried that yet.
the difference against a tsvector gin index though are huge - a factor of 1000 between a sequential and index scan on a moderately small table of a few million rows.
raw sql:
-- 46.439 ms SELECT title FROM indexed WHERE to_tsvector('english', title) @@ to_tsquery('english', 'dog') ;
-- 46954.893 ms SELECT title FROM indexed WHERE to_tsvector('english', title) @@ to_tsquery('dog') ;
-
repo owner im just looking at the docs for this. I guess you are saying:
func.to_tsvector('fat cats ate rats').match('cat & rat')
so....right now you are just saying...
func.to_tsvector('english', 'fat cats ate rats').op('@@')(func.to_tsquery('english', 'cat & rat'))
So not hard to do for now, looking at http://www.postgresql.org/docs/9.1/static/textsearch-controls.html I guess you just want postgresql_regconfig as a kwarg, see attached patch:
from sqlalchemy.dialects import postgresql from sqlalchemy.sql import column, func expr = func.to_tsvector('english', 'fat cats ate rats').match('cat & rat', postgresql_regconfig='english') print expr.compile(dialect=postgresql.dialect())
-
repo owner - attached 3078.patch
-
reporter Yeah, that's exactly it. It's doable in sqlalchemy now with lots of func/op calls (anything is). but it's ugly. the kwarg is a nice approach.
one question
you implemented it in the patch as...
binary.modifiers['postgresql_regconfig'],
but looking elsewhere on ( at
visit_notilike_op_binary
, specifically ) the code has (essentially)self.render_literal_value(binary.modifiers["escape"], sqltypes.STRINGTYPE)
are the binary.modifiers already cleaned, or should the kwarg be processed for security/other concerns like this :
def visit_match_op_binary(self, binary, operator, **kw): if "postgresql_regconfig" in binary.modifiers : regconfig = self.render_literal_value( binary.modifiers['postgresql_regconfig'] ) if regconfig : return "%s @@ to_tsquery('%s', %s)" % ( self.process(binary.left, **kw), regconfig, self.process(binary.right, **kw) ) return "%s @@ to_tsquery(%s)" % ( self.process(binary.left, **kw), self.process(binary.right, **kw))
-
repo owner we can do it that way, sure. since it comes out that way on the left side anyway.
-
repo owner if you want to write me a few short tests, just add them into test/dialect/postgresql/test_query -> MatchTest as assert_compile() tests, send me a PR and we can put it in.
-
repo owner though maybe better a new series of tests in postgresql/test_compiler, the match tests there are spending too much time worrying about paramstyles
-
reporter I'll take a stab at it tonight.
-
reporter https://bitbucket.org/jvanasco/sqlalchemy/commits/82080fcb5f50b2e4f946f3f1e3ae8768d720db5b
slightly modified version of your patch + tests
after doing this, i'm not sure if this is the right approach for the regconfig. i'm encoding it as a string literal, but it could be a bind param like in the calls to func()
-
repo owner should be fine like that. tests look good
-
repo owner you have to turn that into a pull request or ill never find it again
-
repo owner - changed status to resolved
- add postgresql_regconfig argument to PG dialect for match() operator,
implements PG's to_tsvector() pattern. fixes
#3078
→ <<cset 80dec35efbba>>
-
repo owner - add postgresql_regconfig argument to PG dialect for match() operator,
implements PG's to_tsquery('regconfig', 'arg') pattern. fixes
#3078
→ <<cset c996b76d5b90>>
- add postgresql_regconfig argument to PG dialect for match() operator,
implements PG's to_tsquery('regconfig', 'arg') pattern. fixes
-
repo owner - add postgresql_regconfig argument to PG dialect for match() operator,
implements PG's to_tsquery('regconfig', 'arg') pattern. fixes
#3078
→ <<cset 6bb75283f16e>>
- add postgresql_regconfig argument to PG dialect for match() operator,
implements PG's to_tsquery('regconfig', 'arg') pattern. fixes
-
repo owner - changed milestone to 1.x.xx
- Log in to comment
OK well if you are PG specific you'd need to use tsvector() / tsquery() explicitly anyway, right? what's the alternative, having MATCH parse the string for operators? we'd need a full parser for that.