Commits

Matt Chaput committed 4009e7a

Checking in half-finished docs, minor changes.

  • Participants
  • Parent commits b8ecb66

Comments (0)

Files changed (6)

File docs/source/analysis.rst

 
 An analyzer is a function or callable class (a class with a ``__call__`` method)
 that takes a unicode string and returns a generator of tokens. Usually a "token"
-is a word, for example the string "Mary had a little lamb" might yield the tokens
-"Mary", "had", "a", "little", and "lamb". However, tokens do not necessarily
-correspond to words. For example, you might tokenize Chinese text into individual
-characters or bi-grams. Tokens are the units of indexing, that is, they are what
-you are able to look up in the index.
+is a word, for example the string "Mary had a little lamb" might yield the
+tokens "Mary", "had", "a", "little", and "lamb". However, tokens do not
+necessarily correspond to words. For example, you might tokenize Chinese text
+into individual characters or bi-grams. Tokens are the units of indexing, that
+is, they are what you are able to look up in the index.
 
-An analyzer is basically just a wrapper for a tokenizer and zero or more filters.
-The analyzer's ``__call__`` method will pass its parameters to a tokenizer, and
-the tokenizer will usually be wrapped in a few filters.
+An analyzer is basically just a wrapper for a tokenizer and zero or more
+filters. The analyzer's ``__call__`` method will pass its parameters to a
+tokenizer, and the tokenizer will usually be wrapped in a few filters.
 
 A tokenizer is a callable that takes a unicode string and yields a series of
 analysis.Token objects.
 
-For example, the provided :class:`whoosh.analysis.RegexTokenizer` class implements
-a customizable, regular-expression-based tokenizer that extracts words and ignores
-whitespace and punctuation.
+For example, the provided :class:`whoosh.analysis.RegexTokenizer` class
+implements a customizable, regular-expression-based tokenizer that extracts
+words and ignores whitespace and punctuation.
 
 >>> from whoosh.analysis import RegexTokenizer
 >>> tokenizer = RegexTokenizer()
     u'i'
     u'want'
 
-An analyzer is just a means of combining a tokenizer and some filters into a single
-package.
+An analyzer is just a means of combining a tokenizer and some filters into a
+single package.
 
-You can implement an analyzer as a custom class or function, or compose tokenizers
-and filters together using the ``|`` character::
+You can implement an analyzer as a custom class or function, or compose
+tokenizers and filters together using the ``|`` character::
 
 	my_analyzer = RegexTokenizer() | LowercaseFilter() | StopFilter()
 	
 The first item must be a tokenizer and the rest must be filters (you can't put a
-filter first or a tokenizer after the first item). Note that is only works if
-at least the tokenizer is a subclass of ``whoosh.analysis.Composable``, as all
-the tokenizers and filters that ship with Whoosh are.
+filter first or a tokenizer after the first item). Note that is only works if at
+least the tokenizer is a subclass of ``whoosh.analysis.Composable``, as all the
+tokenizers and filters that ship with Whoosh are.
 
 See the :mod:`whoosh.analysis` module for information on the available analyzers,
 tokenizers, and filters shipped with Whoosh.
 When you create a field in a schema, you can specify your analyzer as a keyword
 argument to the field object::
 
-	schema = Schema(content=TEXT(analyzer = StemmingAnalyzer()))
+	schema = Schema(content=TEXT(analyzer=StemmingAnalyzer()))
 
 
 Advanced Analysis
 Token objects
 -------------
 
-The ``Token`` class has no methods. It is merely a place to record certain attributes.
-A ``Token`` object actually has two kinds of attributes: *settings* that record what
-kind of information the Token object does or should contain, and *information* about
-the current token.
+The ``Token`` class has no methods. It is merely a place to record certain
+attributes. A ``Token`` object actually has two kinds of attributes: *settings*
+that record what kind of information the Token object does or should contain,
+and *information* about the current token.
+
 
 Token setting attributes
 ------------------------
 
-A Token object should always have the following attributes. A tokenizer or filter
-can check these attributes to see what kind of information is available and/or what kind of information they should be setting on the Token object.
+A Token object should always have the following attributes. A tokenizer or
+filter can check these attributes to see what kind of information is available
+and/or what kind of information they should be setting on the Token object.
 
-These attributes are set by the tokenizer when it creates the Token(s), based on the
-parameters passed to it from the Analyzer.
+These attributes are set by the tokenizer when it creates the Token(s), based on
+the parameters passed to it from the Analyzer.
 
 Filters **should not** change the values of these attributes.
 
 ====== ================ =================================================== =========
 Type   Attribute name   Description                                         Default
 ====== ================ =================================================== =========
+str    mode             The mode in which the analyzer is being called,     ''
+                        e.g. 'index' during indexing or 'query' during
+                        query parsing
 bool   positions        Whether term positions are recorded in the token    False
 bool   chars            Whether term start and end character indices are    False
                         recorded in the token    
-bool    boosts          Whether per-term boosts are recorded in the token   False
-bool    removestops     Whether stop-words should be removed from the       True
+bool   boosts           Whether per-term boosts are recorded in the token   False
+bool   removestops      Whether stop-words should be removed from the       True
                         token stream
 ====== ================ =================================================== =========
 
+
 Token information attributes
 ----------------------------
 
-A Token object may have any of the following attributes. The text attribute should
-always be present. The original attribute may be set by a tokenizer. All other
-attributes should only be accessed or set based on the values of the "settings"
-attributes above.
+A Token object may have any of the following attributes. The text attribute
+should always be present. The original attribute may be set by a tokenizer. All
+other attributes should only be accessed or set based on the values of the
+"settings" attributes above.
 
 ======== ========== =================================================================
 Type     Name       Description
 ======== ========== =================================================================
 
 So why are most of the information attributes optional? Different field formats
-require different levels of information about each token. For example, the Frequency
-format only needs the token text. The Positions format records term positions, so it
-needs them on the Token. The Characters format records term positions and the start
-and end character indices of each term, so it needs them on the token, and so on.
+require different levels of information about each token. For example, the
+Frequency format only needs the token text. The Positions format records term
+positions, so it needs them on the Token. The Characters format records term
+positions and the start and end character indices of each term, so it needs them
+on the token, and so on.
 
-The Format object that represents the format of each field calls the analyzer for the
-field, and passes it parameters corresponding to the types of information it needs,
-e.g.::
+The Format object that represents the format of each field calls the analyzer
+for the field, and passes it parameters corresponding to the types of
+information it needs, e.g.::
 
     analyzer(unicode_string, positions=True)
 
 The analyzer can then pass that information to a tokenizer so the tokenizer
 initializes the required attributes on the Token object(s) it produces.
 
+
+Performing different analysis for indexing and query parsing
+------------------------------------------------------------
+
+Whoosh sets the ``mode`` setting attribute to indicate whether the analyzer is
+being called by the indexer (``mode='index'``) or the query parser
+(``mode='query'``). This is useful if there's a transformation that you only
+want to apply at indexing or query parsing::
+
+    class MyFilter(Filter):
+        def __call__(self, tokens):
+            for t in tokens:
+                if t.mode == 'query':
+                    ...
+                else:
+                    ...
+
+The :class:`whoosh.analysis.MultiFilter` filter class lets you specify different
+filters to use based on the mode setting::
+
+    intraword = MultiFilter(index=IntraWordFilter(mergewords=True, mergenums=True),
+                            query=IntraWordFilter(mergewords=False, mergenums=False))
+
+
 Stop words
 ----------
 
-"Stop" words are words that are so common it's often counter-productive to index them,
-such as "and", "or", "if", etc. The provided analysis.StopFilter lets you filter out
-stop words, and includes a default list of common stop words.
+"Stop" words are words that are so common it's often counter-productive to index
+them, such as "and", "or", "if", etc. The provided analysis.StopFilter lets you
+filter out stop words, and includes a default list of common stop words.
 
 >>> from whoosh.analysis import StopFilter
 >>> stopper = StopFilter()
 However, this seemingly simple filter idea raises a couple of minor but slightly
 thorny issues: renumbering term positions and keeping or removing stopped words.
 
+
 Renumbering term positions
 --------------------------
 
-Remember that analyzers are sometimes asked to record the position of each token in
-the token stream:
+Remember that analyzers are sometimes asked to record the position of each token
+in the token stream:
 
 ============= ========== ========== ========== ==========
 Token.text    u'Mary'    u'had'     u'a'       u'lamb'
 Token.pos     0          1          2          3
 ============= ========== ========== ========== ==========
 
-So what happens to the ``pos`` attribute of the tokens if ``StopFilter`` removes the
-words ``had`` and ``a`` from the stream? Should it renumber the positions to pretend
-the "stopped" words never existed? I.e.:
+So what happens to the ``pos`` attribute of the tokens if ``StopFilter`` removes
+the words ``had`` and ``a`` from the stream? Should it renumber the positions to
+pretend the "stopped" words never existed? I.e.:
 
 ============= ========== ==========
 Token.text    u'Mary'    u'lamb'
 Token.pos     0          3
 ============= ========== ==========
 
-It turns out that different situations call for different solutions, so the provided
-``StopFilter`` class supports both of the above behaviors. Renumbering is the default,
-since that is usually the most useful and is necessary to support phrase searching.
-However, you can set a parameter in StopFilter's constructor to tell it not to renumber
-positions::
+It turns out that different situations call for different solutions, so the
+provided ``StopFilter`` class supports both of the above behaviors. Renumbering
+is the default, since that is usually the most useful and is necessary to
+support phrase searching. However, you can set a parameter in StopFilter's
+constructor to tell it not to renumber positions::
 
     stopper = StopFilter(renumber=False)
 
+
 Removing or leaving stop words
 ------------------------------
 
-The point of using ``StopFilter`` is to remove stop words, right? Well, there are
-actually some situations where you might want to mark tokens as "stopped" but not remove
-them from the token stream.
+The point of using ``StopFilter`` is to remove stop words, right? Well, there
+are actually some situations where you might want to mark tokens as "stopped"
+but not remove them from the token stream.
 
-For example, if you were writing your own query parser, you could run the user's query
-through a field's analyzer to break it into tokens. In that case, you might want to know
-which words were "stopped" so you can provide helpful feedback to the end user (e.g.
-"The following words are too common to search for:").
+For example, if you were writing your own query parser, you could run the user's
+query through a field's analyzer to break it into tokens. In that case, you
+might want to know which words were "stopped" so you can provide helpful
+feedback to the end user (e.g. "The following words are too common to search
+for:").
 
-In other cases, you might want to leave stopped words in the stream for certain filtering
-steps (for example, you might have a step that looks at previous tokens, and want the
-stopped tokens to be part of the process), but then remove them later.
+In other cases, you might want to leave stopped words in the stream for certain
+filtering steps (for example, you might have a step that looks at previous
+tokens, and want the stopped tokens to be part of the process), but then remove
+them later.
 
-The ``analysis`` module provides a couple of tools for keeping and removing stop-words
-in the stream.
+The ``analysis`` module provides a couple of tools for keeping and removing
+stop-words in the stream.
 
-The ``removestops`` parameter passed to the analyzer's ``__call__`` method (and copied
-to the Token object as an attribute) specifies whether stop words should be removed from
-the stream or left in.
+The ``removestops`` parameter passed to the analyzer's ``__call__`` method (and
+copied to the Token object as an attribute) specifies whether stop words should
+be removed from the stream or left in.
 
 >>> from whoosh.analysis import StandardAnalyzer
 >>> analyzer = StandardAnalyzer()
 >>> [(t.text, t.stopped) for t in analyzer(u"This is a test", removestops=False)]
 [(u'this', True), (u'is', True), (u'a', True), (u'test', False)]
 
-The ``analysis.unstopped()`` filter function takes a token generator and yields only the
-tokens whose stopped attribute is False.
+The ``analysis.unstopped()`` filter function takes a token generator and yields
+only the tokens whose ``stopped`` attribute is ``False``.
 
-Note: even if you leave stopped words in the stream in an analyzer you use for indexing,
-the indexer will ignore any tokens with the stopped attribute set to True.
+Note: even if you leave stopped words in the stream in an analyzer you use for
+indexing, the indexer will ignore any tokens where the ``stopped`` attribute is
+``True``.
+
 
 Implementation notes
 --------------------
 
-Because object creation is slow in Python, the stock tokenizers do not create a new
-analysis.Token object for each token. Instead, they create one Token object and yield
-it over and over. This is a nice performance shortcut but can lead to strange behavior
-if your code tries to remember tokens between loops of the generator.
+Because object creation is slow in Python, the stock tokenizers do not create a
+new analysis.Token object for each token. Instead, they create one Token object
+and yield it over and over. This is a nice performance shortcut but can lead to
+strange behavior if your code tries to remember tokens between loops of the
+generator.
 
-Because the analyzer only has one Token object, of which it keeps changing the attributes,
-if you keep a copy of the Token you get from a loop of the generator, it will be changed
-from under you. For example:
+Because the analyzer only has one Token object, of which it keeps changing the
+attributes, if you keep a copy of the Token you get from a loop of the
+generator, it will be changed from under you. For example:
 
 >>> list(tokenizer(u"Hello there my friend"))
 [Token(u"friend"), Token(u"friend"), Token(u"friend"), Token(u"friend")]
 That is, save the attributes, not the token object itself.
 
 If you implement your own tokenizer, filter, or analyzer as a class, you should
-implement an ``__eq__`` method. This is important to allow comparison of Schema objects.
+implement an ``__eq__`` method. This is important to allow comparison of Schema
+objects.
 
-The mixing of persistent "setting" and transient "information" attributes on the Token
-object is not especially elegant. If I ever have a better idea I might change it ;)
-Nothing requires that an Analyzer be implemented by calling a tokenizer and filters.
-Tokenizers and filters are simply a convenient way to structure the code. You're free to
-write an analyzer any way you want, as long as it implements ``__call__``.
+The mixing of persistent "setting" and transient "information" attributes on the
+Token object is not especially elegant. If I ever have a better idea I might
+change it ;) Nothing requires that an Analyzer be implemented by calling a
+tokenizer and filters. Tokenizers and filters are simply a convenient way to
+structure the code. You're free to write an analyzer any way you want, as long
+as it implements ``__call__``.
 
 
 

File docs/source/index.rst

     querylang
     query
     analysis
+    stemming
+    ngrams
     facets
     highlight
     keywords

File docs/source/spelling.rst

+==============================
 Using the Whoosh spell checker
 ==============================
 
+
 Overview
---------
+========
 
 Whoosh includes pure-Python spell-checking library functions that use the Whoosh
 search engine for back-end storage.
 
 
 Creating the spelling dictionary
---------------------------------
+================================
 
 You need to populate the spell-checking dictionary with (properly spelled) words
 to check against. There are a few strategies for doing this:
 
 
 Gettings suggestions
---------------------
+====================
 
 Once you have words in the spelling dictionary, you can use the ``suggest()``
 method to check words::
 
 
 Word scores
------------
+===========
 
 Each word in the dictionary can have a "score" associated with it. When two or
 more suggestions have the same "edit distance" (number of differences) from the
 
 
 Spell checking Whoosh queries
------------------------------
+=============================
 
 If you want to spell check a user query, first parse the user's query into a
 ``whoosh.query.Query`` object tree, using the default parser or your own custom
 
 
 Updating the spelling dictionary
---------------------------------
+================================
 
 The spell checker is mainly intended to be "write-once, read-many". You can
 continually add words to the dictionary, but it is not possible to remove words

File docs/source/stemming.rst

+========================================
+Stemming, variations, and accent folding
+========================================
+
+The problem
+===========
+
+The indexed text will often have contain words in different form than the one
+the user searches for. For example, if the user searches for ``render``, we
+would like the search to match not only documents that contain the ``render``,
+but also ``renders``, ``rendering``, ``rendered``, etc.
+
+A related problem is one of accents. Names and loan words may contain accents in
+the original text but not in the user's query, or vice versa. For example, we
+want the user to be able to search for ``cafe`` and find documents containing
+``café``.
+
+The default analyzer for the :class:`whoosh.fields.TEXT` field does not do
+stemming or accent folding. In order to allow 
+
+
+Stemming
+========
+
+Stemming is a heuristic process of removing suffixes (and sometimes prefixes)
+from words to arrive (hopefully, most of the time) at the base word. Whoosh
+includes several stemming algorithms such as Porter and Porter2, Paice Husk,
+and Lovins.
+
+>>> from whoosh.lang.porter import stem
+>>> stem("rendering")
+'render'
+
+The stemming filter applies the stemming function to the terms it indexes, and
+to words in user queries. So 
+
+The :class:`whoosh.analysis.StemFilter` lets you add a stemming filter to an
+analyzer chain.
+
+>>> rext = RegexTokenizer()
+>>> stream = rext(u"fundamentally willows")
+>>> stemmer = StemFilter()
+>>> [token.text for token in stemmer(stream)]
+[u"fundament", u"willow"]
+
+The :func:`whoosh.analysis.StemmingAnalyzer` is a pre-packaged analyzer that
+combines a tokenizer, lower-case filter, optional stop filter, and stem filter::
+
+    from whoosh import fields
+    from whoosh.analysis import StemmingAnalyzer
+    
+    stem_ana = StemmingAnalyzer()
+    schema = fields.Schema(title=TEXT(analyzer=stem_ana, stored=True),
+                           content=TEXT(analyzer=stem_ana))
+
+Stemming has pros and cons.
+
+* It allows the user to find documents without worrying about word forms.
+
+* It reduces the size of the index, since it reduces the number of separate
+  terms indexed by "collapsing" multiple word forms into a single base word.
+
+* It's faster than using variations (see below)
+
+* The stemming algorithm can sometimes incorrectly conflate words or change
+  the meaning of a word by removing suffixes.
+
+* The stemmed forms are often not proper words, so the terms in the field
+  are not useful for things like creating a spelling dictionary.
+
+
+Variations
+==========
+
+Whereas stemming encodes the words in the index in a base form, when you use
+variations you instead index words "as is" and *at query time* expand words
+in the user query using a heuristic algorithm to generate morphological
+variations of the word.
+
+>>> from whoosh.lang.morph_en import variations
+>>> variations("rendered")
+set(['rendered', 'rendernesses', 'render', 'renderless', 'rendering',
+'renderness', 'renderes', 'renderer', 'renderements', 'rendereless',
+'renderenesses', 'rendere', 'renderment', 'renderest', 'renderement',
+'rendereful', 'renderers', 'renderful', 'renderings', 'renders', 'renderly',
+'renderely', 'rendereness', 'renderments'])
+
+Many of the generated variations for a given word will not be valid words, but
+it's fairly fast for Whoosh to check which variations are actually in the
+index and only search for those.
+
+The :class:`whoosh.query.Variations` query object lets you search for variations
+of a word. Whereas the normal :class:`whoosh.query.Term` object only searches
+for the given term, the ``Variations`` query acts like an ``Or`` query for the
+variations of the given word in the index. For example, the query::
+
+    query.Variations("content", "rendered")
+    
+...might act like this (depending on what words are in the index)::
+
+    query.Or([query.Term("content", "render"), query.Term("content", "rendered"),
+              query.Term("content", "renders"), query.Term("content", "rendering")])
+
+To have the query parser use :class:`whoosh.query.Variations` instead of
+:class:`whoosh.query.Term` for individual terms, use the ``termclass``
+keyword argument to the parser initialization method::
+
+    from whoosh import qparser, query
+    
+    qp = qparser.QueryParser("content", termclass=query.Variations)
+
+Variations has pros and cons.
+
+* It allows the user to find documents without worrying about word forms.
+
+* The terms in the field are actual words, not stems, so you can use the
+  field's contents for other purposes such as spell checking queries.
+
+* It increases the size of the index relative to stemming, because different
+  word forms are indexed separately.
+  
+* It acts like an ``Or`` search for all the variations, which is slower than
+  searching for a single term.
+  
+
+Lemmatization
+=============
+
+Whereas stemming is a somewhat "brute force", mechanical attempt at reducing
+words to their base form using simple rules, lemmatization usually refers to
+more sophisticated methods of finding the base form ("lemma") of a word using
+language models, often involving analysis of the surrounding context and
+part-of-speech tagging.
+
+Whoosh does not include any lemmatization functions, but if you have separate
+lemmatizing code you could write a custom :class:`whoosh.analysis.Filter`
+to integrate it into a Whoosh analyzer.
+
+
+Character folding
+=================
+
+You can set up an analyzer to treat, for example, ``á``, ``a``, ``å``, and ``â``
+as equivalent to improve recall. This is often very useful, allowing the user
+to, for example, type ``cafe`` or ``resume`` and find documents containing
+``café`` and ``resumé``.
+
+Character folding is especially useful for unicode characters that may appear
+in Asian language texts that should be treated as equivalent to their ASCII
+equivalent, such as "half-width" characters.
+
+Character folding is not always a panacea. See this article for caveats on where
+accent folding can break down.
+
+http://www.alistapart.com/articles/accent-folding-for-auto-complete/
+
+Whoosh includes several mechanisms for adding character folding to an analyzer.
+
+The :class:`whoosh.analysis.CharsetFilter` applies a character map to token
+text. For example, it will filter the tokens ``u'café', u'resumé', ...`` to
+``u'cafe', u'resume', ...``. This is the usually the method you'll want to use
+unless you need to use a charset to tokenize terms.
+
+    from whoosh.analysis import CharsetFilter, StemmingAnalyzer
+    from whoosh import fields
+    from whoosh.support.charset import accent_map
+    
+    # For example, to add an accent-folding filter to a stemming analyzer:
+    my_analyzer = StemmingAnalyzer | CharsetFilter(accent_map)
+    
+    # To use this analyzer in your schema:
+    my_schema = fields.Schema(content=fields.TEXT(analyzer=my_analyzer))
+
+The :class:`whoosh.analysis.CharsetTokenizer` uses a Sphinx charset table to
+both separate terms and perform character folding. This tokenizer is slower
+than the :class:`whoosh.analysis.RegexTokenizer` because it loops over each
+character in Python. If the language(s) you're indexing can be tokenized using
+regular expressions, it will be much faster to use ``RegexTokenizer`` and
+``CharsetFilter`` in combination instead of using ``CharsetTokenizer``::
+
+The :mod:`whoosh.support.charset` module contains an accent folding map useful
+for most Western languages, as well as a much more extensive Sphinx charset
+table and a function to convert Sphinx charset tables into the character maps
+required by ``CharsetTokenizer`` and ``CharsetFilter``::
+    
+    # To create a filter using an enourmous character map for most languages
+    # generated from a Sphinx charset table
+    from whoosh.analysis import CharsetFilter
+    from whoosh.support.charset import default_charset, charset_table_to_dict
+    charmap = charset_table_to_dict(default_charset)
+    my_analyzer = StemmingAnalyzer | CharsetFilter(charmap)
+
+(The Sphinx charset table format is described at
+http://www.sphinxsearch.com/docs/current.html#conf-charset-table )
+
+
+
+
+
+
+
+
+
+
+
+
+
+

File src/whoosh/matching.py

     def id(self):
         return self.a.id()
     
-    def all_ids(self):
-        return iter(sorted(set(self.a.all_ids()) & set(self.b.all_ids())))
+    #def all_ids(self):
+    #    return iter(sorted(set(self.a.all_ids()) & set(self.b.all_ids())))
     
     def skip_to(self, id):
         if not self.is_active(): raise ReadTooFar

File src/whoosh/query.py

 
     def simplify(self, ixreader):
         existing = [Term(self.fieldname, word, boost=self.boost)
-                    for word in self._words(ixreader)]
+                    for word in set(self._words(ixreader))]
         if len(existing) == 1:
             return existing[0]
         elif existing: