Matt Chaput avatar Matt Chaput committed 373059e

Adding benchmark products to .hgignore, missed a doc in last commit

Comments (0)

Files changed (2)

 
 bmark
 *testindex
+benchmark/enron_index
+benchmark/reuters_index
+benchmark/enron_cache.pickle
+benchmark/enron_mail_082109.tar.gz

docs/source/ngrams.rst

+==============================
+Indexing and searching N-grams
+==============================
+
+Overview
+========
+
+N-gram indexing is a powerful method for getting fast, "search as you type"
+functionality like iTunes. It is also useful for quick and effective indexing
+of languages such as Chinese and Japanese without word breaks.
+
+N-grams refers to groups of N characters... bigrams are groups of two
+characters, trigrams are groups of three characters, and so on.
+
+Whoosh includes two methods for analyzing N-gram fields: an N-gram tokenizer,
+and a filter that breaks tokens into N-grams.
+
+:class:`whoosh.analysis.NgramTokenizer` tokenizes the entire field into N-grams.
+This is more useful for Chinese/Japanese/Korean languages, where it's useful
+to index bigrams of characters rather than individual characters. Using this
+tokenizer with roman languages leads to spaces in the tokens.
+
+>>> ngt = NgramTokenizer(minsize=2, maxsize=4)
+>>> [token.text for token in ngt(u"hi there")]
+[u'hi', u'hi ', u'hi t',u'i ', u'i t', u'i th', u' t', u' th', u' the', u'th',
+u'the', u'ther', u'he', u'her', u'here', u'er', u'ere', u're']
+
+:class:`whoosh.analysis.NgramFilter` breaks individual tokens into N-grams as
+part of an analysis pipeline. This is more useful for languages with word
+separation.
+
+>>> my_analyzer = StandardAnalyzer() | NgramFilter(minsize=2, maxsize=4)
+>>> [token.text for token in my_analyzer(u"rendering shaders")]
+[u'ren', u'rend', u'end', u'ende', u'nde', u'nder', u'der', u'deri', u'eri',
+u'erin', u'rin', u'ring', u'ing', u'sha', u'shad', u'had', u'hade', u'ade',
+u'ader', u'der', u'ders', u'ers']
+
+
+
+
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.