1. Matt Chaput
  2. whoosh


Matt Chaput  committed 373059e

Adding benchmark products to .hgignore, missed a doc in last commit

  • Participants
  • Parent commits 4009e7a
  • Branches default

Comments (0)

Files changed (2)

File .hgignore

View file
  • Ignore whitespace

File docs/source/ngrams.rst

View file
  • Ignore whitespace
+Indexing and searching N-grams
+N-gram indexing is a powerful method for getting fast, "search as you type"
+functionality like iTunes. It is also useful for quick and effective indexing
+of languages such as Chinese and Japanese without word breaks.
+N-grams refers to groups of N characters... bigrams are groups of two
+characters, trigrams are groups of three characters, and so on.
+Whoosh includes two methods for analyzing N-gram fields: an N-gram tokenizer,
+and a filter that breaks tokens into N-grams.
+:class:`whoosh.analysis.NgramTokenizer` tokenizes the entire field into N-grams.
+This is more useful for Chinese/Japanese/Korean languages, where it's useful
+to index bigrams of characters rather than individual characters. Using this
+tokenizer with roman languages leads to spaces in the tokens.
+>>> ngt = NgramTokenizer(minsize=2, maxsize=4)
+>>> [token.text for token in ngt(u"hi there")]
+[u'hi', u'hi ', u'hi t',u'i ', u'i t', u'i th', u' t', u' th', u' the', u'th',
+u'the', u'ther', u'he', u'her', u'here', u'er', u'ere', u're']
+:class:`whoosh.analysis.NgramFilter` breaks individual tokens into N-grams as
+part of an analysis pipeline. This is more useful for languages with word
+>>> my_analyzer = StandardAnalyzer() | NgramFilter(minsize=2, maxsize=4)
+>>> [token.text for token in my_analyzer(u"rendering shaders")]
+[u'ren', u'rend', u'end', u'ende', u'nde', u'nder', u'der', u'deri', u'eri',
+u'erin', u'rin', u'ring', u'ing', u'sha', u'shad', u'had', u'hade', u'ade',
+u'ader', u'der', u'ders', u'ers']