Commits

Matt Chaput  committed d250e2b

Initial commit of whoosh source.

  • Participants
  • Parent commits c73d84a

Comments (0)

Files changed (20)

+
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright [yyyy] [name of copyright owner]
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

File src/whoosh/__init__.py

+#===============================================================================
+# Copyright 2007 Matt Chaput
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#    http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#===============================================================================
+
+import logging, sys
+
+log = logging.getLogger("whoosh")
+log.addHandler(logging.StreamHandler(sys.stderr))

File src/whoosh/analysis.py

+#===============================================================================
+# Copyright 2007 Matt Chaput
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#    http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#===============================================================================
+
+import re
+
+try:
+    # Use C extensions for splitting and stemming from zopyx,
+    # if available
+    from zopyx.txng3 import stemmer, splitter
+    StemFilter = stemmer.Stemmer("english").stem
+    SimpleTokenizer = splitter.Splitter().split
+    
+except ImportError:
+    # Use pure-Python versions.
+    from whoosh.support.porter import stem
+    def StemFilter(self, ws):
+        for w in ws:
+            yield stem(w)
+    
+    def SimpleTokenizer(text):
+        exp = re.compile(r"\W|_")
+        for w in exp.split(text):
+            if w and len(w) > 0:
+                yield w.lower()
+
+_defaultStopWords = ["the", "to", "of", "a", "and", "is", "in", "this",
+                     "you", "for", "be", "on", "or", "will", "if", "can", "are",
+                     "that", "by", "with", "it", "as", "from", "an", "when",
+                     "not", "may", "use", "tbd"]
+
+class StopFilter(object):
+    def __init__(self, stop_words):
+        self.stops = set(stop_words)
+    
+    def __call__(self, ws):
+        for w in ws:
+            if len(w) > 2 and not w in self.stops:
+                yield w
+
+
+class NgramTokenizer(object):
+    def __init__(self, min, max):
+        self.min = min
+        self.max = max
+        
+        self.normalizeExp = re.compile(r"\W+")
+    
+    def __call__(self, text):
+        text = "".join([" ", self.normalizeExp.sub(" ", text).strip(), " "])
+        inLen = len(text)
+        
+        for gramSize in xrange(self.min, self.max + 1):
+            pos = 0
+            limit = inLen - gramSize + 1
+            
+            for pos in xrange(0, limit):
+                yield text[pos:pos + gramSize]
+
+
+def LowerCaseFilter(ws):
+    for w in ws:
+        yield w.lower()
+
+class Analyzer(object):
+    def filter(self, ws):
+        return ws
+    
+    def words(self, text):
+        return self.filter(SimpleTokenizer(text))
+
+class SimpleAnalyzer(Analyzer):
+    def words(self, text):
+        return SimpleTokenizer(text)
+        
+class StopAnalyzer(Analyzer):
+    def __init__(self, stop_words = _defaultStopWords):
+        self.stopper = StopFilter(stop_words)
+    
+    def filter(self, ws):
+        return self.stopper(ws)
+    
+class StemmingAnalyzer(Analyzer):
+    def __init__(self, stop_words = _defaultStopWords):
+        self.stopper = StopFilter(stop_words)
+    
+    def filter(self, ws):
+        return StemFilter(list(self.stopper(ws)))
+    
+class IDAnalyzer(Analyzer):
+    def __init__(self):
+        self.tokenizer = None
+        self.filters = []
+    
+    def words(self, text):
+        yield text
+
+class NgramAnalyzer(Analyzer):
+    def __init__(self, min = 3, max = 4):
+        self.tokenizer = NgramTokenizer(min, max)
+    
+    def words(self, text):
+        for w in self.filter(self.tokenizer(text)):
+            yield w
+    
+    def filter(self, ws):
+        return LowerCaseFilter(ws)
+
+    
+
+
+

File src/whoosh/fields.py

+#===============================================================================
+# Copyright 2007 Matt Chaput
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#    http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#===============================================================================
+
+from collections import defaultdict
+
+
+def has_positions(field):
+    return hasattr(field, "data_to_positions")
+
+
+class Field(object):
+    def __init__(self, name, analyzer, field_boost = 1.0, **options):
+        self.name = name
+        self.analyzer = analyzer
+        self.field_boost = field_boost
+        self.number = -1
+        self.options = options
+        
+    def __repr__(self):
+        return "%s(\"%s\", %s, boost = %s)" % (self.__class__.__name__,
+                                              self.name,
+                                              repr(self.analyzer),
+                                              self.field_boost)
+    
+    def __eq__(self, other):
+        if not hasattr(other, "name"): return False
+        if not hasattr(other, "analyzer"): return False
+        if not hasattr(other, "boost"): return False
+        return other.__class__ is self.__class__\
+            and other.name == self.name\
+            and other.analyzer == self.analyzer\
+            and other.boost == self.boost
+
+class IDField(Field):
+    def word_datas(self, value, **kwargs):
+        seen = set()
+        for w in self.analyzer.words(value):
+            seen.add(w)
+            
+        for w in seen:
+            yield (w, None)
+    
+    def write_postvalue(self, stream, data):
+        return
+    
+    def read_postvalue(self, stream):
+        return None
+    
+    def data_to_weight(self, data):
+        return self.field_boost
+    
+class FrequencyField(Field):
+    def word_datas(self, value, **kwargs):
+        seen = defaultdict(int)
+        for w in self.analyzer.words(value):
+            seen[w] += 1
+            
+        return seen.iteritems()
+
+    def write_postvalue(self, stream, data):
+        stream.write_varint(data)
+        
+    def read_postvalue(self, stream):
+        return stream.read_varint()
+    
+    def data_to_weight(self, data):
+        return data * self.field_boost
+
+class DocBoostField(FrequencyField):
+    def word_datas(self, value, doc_boost = 1.0, **kwargs):
+        seen = defaultdict(int)
+        for w in self.analyzer.words(value):
+            seen[w] += 1
+            
+        return [(w, (freq, doc_boost)) for w, freq in seen.iteritems()]
+    
+    def write_postvalue(self, stream, data):
+        stream.write_varint(data[0])
+        stream.write_8bitfloat(data[1]) # , self.options.get("limit", 8)
+        
+    def read_postvalue(self, stream):
+        return (stream.read_varint(), stream.read_8bitfloat()) # , self.options.get("limit", 8)
+    
+    def data_to_weight(self, data):
+        return data[0] * data[1] * self.field_boost
+
+class PositionField(Field):
+    def word_datas(self, value, **kwargs):
+        seen = defaultdict(list)
+        for pos, w in enumerate(self.analyzer.words(value)):
+            seen[w].append(pos)
+            
+        return seen.iteritems()
+    
+    def write_postvalue(self, stream, data):
+        pos_base = 0
+        stream.write_varint(len(data))
+        for pos in data:
+            stream.write_varint(pos - pos_base)
+            pos_base = pos
+            
+    def read_postvalue(self, stream):
+        pos_base = 0
+        pos_list = []
+        for i in xrange(stream.read_varint()): #@UnusedVariable
+            pos_base += stream.read_varint()
+            pos_list.append(pos_base)
+        return pos_list
+    
+    def data_to_weight(self, data):
+        return len(data) * self.field_boost
+    
+    def data_to_positions(self, data):
+        return data
+
+class PositionBoostField(PositionField):
+    def word_datas(self, value, boosts = {}, **kwargs):
+        seen = defaultdict(list)
+        for pos, w in enumerate(self.analyzer.words(value)):
+            seen[w].append((pos, boosts.get(pos, 1.0)))
+            
+        return seen.iteritems()
+    
+    def write_postvalue(self, stream, data):
+        pos_base = 0
+        stream.write_varint(len(data))
+        for pos, boost in data:
+            stream.write_varint(pos - pos_base)
+            stream.write_8bitfloat(boost) # , self.options.get("limit", 8)
+            pos_base = pos
+
+    def read_postvalue(self, stream):
+        freq = stream.read_varint()
+        pos_base = 0
+        pos_list = []
+        for i in xrange(freq): #@UnusedVariable
+            pos_base += stream.read_varint()
+            pos_list.append((pos_base, stream.read_8bitfloat())) # , self.options.get("limit", 8)
+        return (freq, pos_list)
+
+    def data_to_positions(self, data):
+        return [d[0] for d in data[1]]
+
+    def data_to_position_boosts(self, data):
+        return data[1]
+
+
+
+
+
+
+
+
+
+
+

File src/whoosh/highlight.py

+#===============================================================================
+# Copyright 2008 Matt Chaput
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#    http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#===============================================================================
+
+def find_hilights(tokens, term_set, words_before = 5, words_after = 5, word_limit = 20, fragment_limit = 4):
+    hits = 0
+    
+    queue = []
+    current = []
+    countdown = 0
+    
+    for t in tokens:
+        if t in term_set:
+            if countdown == 0:
+                current += queue
+                queue = []
+                hits += 1
+            
+            current.append((t, ))
+            countdown = words_after
+        
+        elif countdown > 0:
+            current.append(t)
+            countdown -= 1
+            
+            if len(current) > word_limit:
+                countdown = 0
+            
+            if countdown == 0:
+                yield current
+                current = []
+                
+                if fragment_limit is not None and hits > fragment_limit:
+                    break
+        else:
+            if len(queue) >= words_before:
+                queue = queue[1:]
+            queue.append(t)
+    
+    if countdown > 0:
+        yield current
+
+def format_fragment(frag):
+    result = ""
+    for t in frag:
+        if isinstance(t, tuple):
+            result += " <strong>%s</strong>" % t[0]
+        else:
+            result += " " + t
+    if result.startswith(" "):
+        result = result[1:]
+    return result
+
+def associate(text, analyzer):
+    for t in analyzer.tokenizer(text):
+        g = analyzer.filter(t)
+        if g:
+            yield (t, g)
+
+if __name__ == '__main__':
+    import time
+    import analysis
+    
+    f = open("c:/dev/src/houdini/help/documents/nodes/sop/copy.txt", "rb")
+    txt = f.read()
+    f.close()
+    
+    txt = txt[txt.rfind('"""'):]
+    
+    a = analysis.SimpleAnalyzer()
+    
+    for x in associate(txt, a):
+        print x
+    
+    t = time.clock()
+    for frag in find_hilights(a.words(txt), set(["sop"])):
+        pass
+        #print format_fragment(frag)
+    print time.clock() - t
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    

File src/whoosh/index.py

+#===============================================================================
+# Copyright 2007 Matt Chaput
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#    http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#===============================================================================
+
+import re
+from bisect import bisect_right
+
+import reading, writing
+from support.bitvector import BitVector
+
+
+_toc_filename = re.compile("_toc([0-9]+)")
+_segment_filename = re.compile("(_[0-9]+)\\.(dcs|dcx|pst|tix)")
+
+
+class OutOfDateError(Exception): pass
+
+
+def _last_generation(storage):
+    """
+    Utility function to find the most recent
+    generation number of the index. The index will use
+    this to start a new generation, and a reader can use
+    this to check if it's up to date.
+    """
+    
+    max = -1
+    for filename in storage.list():
+        m = _toc_filename.match(filename)
+        if m:
+            num = int(m.group(1))
+            if num > max: max = num
+    return max
+
+def create(storage, schema):
+    """
+    Creates an index in the specified storage object,
+    using the specified field schema.
+    """
+    
+    storage.clean()
+    write_index_file(storage, 0, [], schema, 0)
+    return Index(storage)
+
+def write_index_file(storage, generation, segments, schema, counter):
+    stream = storage.create_file("_toc%s" % generation)
+    stream.write_pickle((segments, schema, counter))
+    stream.close()
+
+def read_index_file(storage, generation):
+    stream = storage.open_file("_toc%s" % generation)
+    segments, schema, counter = stream.read_pickle()
+    stream.close()
+    return segments, schema, counter
+
+
+class Schema(object):
+    def __init__(self, *fields):
+        self.by_number = []
+        self.by_name = {}
+        
+        for field in fields:
+            self.add(field)
+    
+    def has_name(self, name):
+        return self.by_name.has_key(name)
+    
+    def has_field(self, field):
+        return self.has_name(field.name) and self.by_name[field.name] == field
+    
+    def add(self, field):
+        if self.by_name.has_key(field.name):
+            raise Exception("Schema already has a field named %s" % field.name)
+        
+        num = len(self.by_number)
+        field.number = num
+        self.by_number.append(field)
+        self.by_name[field.name] = field
+
+
+class Index(object):
+    def __init__(self, storage):
+        self.storage = storage
+        
+        self.generation = _last_generation(storage)
+        if self.generation >= 0:
+            self.reload()
+    
+    def field_by_name(self, name):
+        if name in self.schema.by_name:
+            return self.schema.by_name[name]
+        else:
+            raise ValueError("No field named '%s'" % name)
+    
+    def doc_count(self):
+        return sum([s.max_doc - s.deleted_count() for s in self.segments])
+    
+    def reader(self):
+        segs = self.segments
+        if len(segs) == 0: return None
+        if len(segs) == 1:
+            return reading.SegmentReader(self, segs[0])
+        else:
+            return reading.MultiSegmentReader(self, segs)
+    
+    def up_to_date(self):
+        return self.generation == _last_generation(self.storage)
+    
+    def next_segment_name(self):
+        self.counter += 1
+        return "_%s" % self.counter
+    
+    def reload(self):
+        segments, self.schema, self.counter = read_index_file(self.storage, self.generation)
+        self._set_segments(segments)
+    
+    def _set_segments(self, segments):
+        self.segments = segments
+        
+        self.doc_offsets = []
+        self.max_doc = 0
+        self.term_count_multiplied = 0.0
+        self.term_count_actual = 0
+        
+        for segment in self.segments:
+            self.doc_offsets.append(self.max_doc)
+            self.max_doc += segment.max_doc
+            self.term_count_multiplied += segment.term_count_multiplied
+            self.term_count_actual += segment.term_count_actual
+    
+    def _document_segment(self, docnum):
+        if len(self.doc_offsets) == 1: return 0
+        return bisect_right(self.doc_offsets, docnum) - 1
+    
+    def delete_document(self, docnum):
+        segmentnum = self._document_segment(docnum)
+        offset = self.doc_offsets[segmentnum]
+        segment = self.segments[segmentnum]
+        segment.delete_document(docnum - offset)
+    
+    def delete_by_term(self, fieldname, text):
+        r = self.reader()
+        tr = r.term_reader()
+        fieldnum = self.field_by_name(fieldname).number
+        try:
+            tr.find_term(fieldnum, text)
+            for docnum, data in tr.postings(): #@UnusedVariable
+                print "Deleting", docnum
+                self.delete_document(docnum)
+            return tr.doc_freq
+        except reading.TermNotFound:
+            return 0
+    
+    def has_deletions(self):
+        for segment in self.segments:
+            if segment.has_deletions(): return True
+        return False
+    
+    def optimize(self):
+        if len(self.segments) < 2 and not self.has_deletions():
+            return
+        w = writing.IndexWriter(self)
+        w.optimize()
+        w.close()
+    
+    def checkpoint(self):
+        if not self.up_to_date():
+            raise OutOfDateError
+        
+        self.generation += 1
+        write_index_file(self.storage, self.generation, self.segments, self.schema, self.counter)
+        self.clean_files()
+    
+    def clean_files(self):
+        storage = self.storage
+        current_segment_names = set([s.name for s in self.segments])
+        
+        for filename in storage.list():
+            m = _toc_filename.match(filename)
+            if m:
+                num = int(m.group(1))
+                if num != self.generation:
+                    storage.delete_file(filename)
+            else:
+                m = _segment_filename.match(filename)
+                if m:
+                    name = m.group(1)
+                    if name not in current_segment_names:
+                        storage.delete_file(filename)
+                else:
+                    storage.delete_file(filename)
+
+
+class Segment(object):
+    def __init__(self, name, max_doc, term_count_multiplied, term_count_actual, deleted = None):
+        self.name = name
+        self.max_doc = max_doc
+        self.term_count_multiplied = term_count_multiplied
+        self.term_count_actual = term_count_actual
+        self.deleted = deleted
+    
+    def __repr__(self):
+        return "%s(\"%s\")" % (self.__class__.__name__, self.name)
+    
+    def has_deletions(self):
+        return self.deleted_count() > 0
+    
+    def deleted_count(self):
+        if self.deleted is None: return 0
+        return self.deleted.count()
+    
+    def delete_document(self, docnum):
+        if self.deleted is None:
+            self.deleted = BitVector(self.max_doc)
+            
+        self.deleted.set(docnum)
+    
+    def is_deleted(self, docnum):
+        if self.deleted is None: return False
+        return self.deleted.get(docnum)
+    
+
+def dump_index(ix):
+    print "Index stored in", ix.storage
+    print "Index has %s segments:" % len(ix.segments)
+    for seg in ix.segments:
+        print "Segment", seg.name
+        reader = reading.SegmentReader(ix, seg)
+        
+        print "  Documents:"
+        docnum = 0
+        for tcm, tca, payload in reader.doc_reader():
+            d = "DEL" if reader.is_deleted(docnum) else "   "
+            print "    ", d, "tcm=", tcm, "tca=", tca, "payload=", repr(payload)
+            docnum += 1
+            
+        print "  Terms:"
+        tr = reader.term_reader()
+        by_number = ix.schema.by_number
+        for fieldnum, text, freq in tr:
+            print "    %s:%s" % (by_number[fieldnum].name, text), "freq=", freq
+            for docnum, data in tr.postings():
+                print "      docnum=", docnum, "data=", repr(data)
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    

File src/whoosh/postpool.py

+
+#===============================================================================
+# Copyright 2007 Matt Chaput
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#    http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#===============================================================================
+
+import cPickle, struct
+from bisect import insort_left
+
+import structfile
+
+_intSize = struct.calcsize("!i")
+
+
+def encode_posting(fieldNum, text, doc, data):
+    return "".join([struct.pack("!i", fieldNum),
+                    text.encode("utf8"),
+                    chr(0),
+                    struct.pack("!i", doc),
+                    cPickle.dumps(data, -1)
+                    ])
+
+def decode_posting(posting):
+    pointer = 0
+    
+    field_num = struct.unpack("!i", posting[pointer:pointer + _intSize])[0]
+    pointer += _intSize
+    
+    zero = posting.find(chr(0), pointer)
+    text = posting[pointer:zero].decode("utf8")
+    pointer = zero + 1
+    
+    doc = struct.unpack("!i", posting[pointer:pointer + _intSize])[0]
+    pointer += _intSize
+    
+    data = cPickle.loads(posting[pointer:])
+    
+    return field_num, text, doc, data
+
+#===============================================================================
+
+class RunReader(object):
+    def __init__(self, stream, buffer_size):
+        self.stream = stream
+        self.buffer_size = buffer_size
+        self.buffer = []
+        self.pointer = 0
+        self.finished = False
+        
+    def _fill(self):
+        if self.finished:
+            return
+        
+        buffer_size = self.buffer_size
+        self.pointer = 0
+        buffer = self.buffer = []
+        s = 0
+        
+        while s < buffer_size:
+            try:
+                p = self.stream.read_string()
+                buffer.append(p)
+                s += len(p)
+            except structfile.EndOfFile:
+                break
+        
+    def next(self):
+        assert self.pointer <= len(self.buffer)
+        if self.pointer == len(self.buffer):
+            self._fill()
+        if len(self.buffer) == 0:
+            self.finished = True
+            return None
+        
+        r = self.buffer[self.pointer]
+        self.pointer += 1
+        return r
+
+
+class PostingPool(object):
+    def __init__(self, directory, limit = 4 * 1024 * 1024):
+        self.directory = directory
+        self.run_count = 0
+        self.limit = limit
+        self.size = 0
+        self.postings = []
+        self.finished = False
+    
+    def add_posting(self, field_num, text, doc, data):
+        if self.finished:
+            raise Exception("Can't add postings after you finish the pool")
+        
+        posting = encode_posting(field_num, text, doc, data)
+        
+        self.size += len(posting)
+        self.postings.append(posting)
+        
+        if self.size >= self.limit:
+            self.flush_run()
+    
+    def delete_runs(self):
+        for i in xrange(0, self.run_count):
+            self.directory.delete_file("_run%s" % i)
+    
+    def flush_run(self):
+        if self.size > 0:
+            runNum = self.run_count
+            self.run_count += 1
+            
+            self.postings.sort()
+            
+            run = self.directory.create_file("_run%s" % runNum)
+            for p in self.postings:
+                run.write_string(p)
+            run.close()
+            
+            self.postings = []
+            self.size = 0
+    
+    def __iter__(self):
+        #This method does an external merge to yield postings
+        #from the (n > 1) runs built up during indexing and
+        #merging.
+        
+        # Divide up the posting pool's memory limit between the
+        # number of runs plus an output buffer.
+        max_chunk_size = int(self.limit / (self.run_count + 1))
+        
+        run_readers = [RunReader(self.directory.open_file("_run%s" % i),
+                                 max_chunk_size)
+                      for i in xrange(0, self.run_count)]
+        
+        # Initialize a list of terms we're "current"ly
+        # looking at, by taking the first posting from
+        # each buffer.
+        #
+        # The format of the list is
+        # [("encoded_posting", reader_number), ...]
+        #
+        # The list is sorted, and the runs are already
+        # sorted, so the first term in this list should
+        # be the absolute "lowest" term.
+        
+        current = sorted([(r.next(), i)
+                          for i, r
+                          in enumerate(run_readers)])
+        
+        # The number of active readers (readers with more
+        # postings to available), initially equal
+        # to the total number of readers/buffers.
+        
+        active = len(run_readers)
+        
+        # Initialize the output buffer, and a variable to
+        # keep track of the output buffer size. This buffer
+        # accumulates postings from the various buffers in
+        # proper sorted order.
+        
+        output = []
+        outputBufferSize = 0
+        
+        while active > 0:
+            # Get the first ("encoded_posting", reader_number)
+            # pair and add it to the output buffer.
+            
+            p, i = current[0]
+            output.append(p)
+            outputBufferSize += len(p)
+            current = current[1:]
+            
+            # If the output buffer is full, "flush" it by yielding
+            # the accumulated postings back to the parent writer
+            # and clearing the output buffer.
+            
+            if outputBufferSize > max_chunk_size:
+                for p in output:
+                    yield decode_posting(p)
+                output = []
+                outputBufferSize = 0
+            
+            # We need to replace the posting we just added to the output
+            # by getting the next posting from the same buffer.
+            
+            if run_readers[i] is not None:
+                # Take the first posting from buffer i and insert it into the
+                # "current" list in sorted order.
+                # The current list must always stay sorted, so the first item
+                # is always the lowest.
+                
+                p = run_readers[i].next()
+                if p:
+                    insort_left(current, (p, i))
+                else:
+                    run_readers[i] = None
+                    active -= 1
+        
+        # If there's still terms in the "current" list after all the
+        # readers are empty, dump them into the output buffer.
+        
+        if len(current) > 0:
+            output.extend([p for p, i in current])
+        
+        # If there's still postings in the output buffer, yield
+        # them all to the parent writer.
+        
+        if len(output) > 0:
+            for p in output:
+                yield decode_posting(p)
+            
+        # And we're done.
+    
+    def finish(self):
+        if self.finished:
+            raise Exception("Called finish on pool more than once")
+        
+        self.flush_run()
+        self.finished = True
+        
+        
+        
+        

File src/whoosh/qparser.py

+"""
+== Search query parser ==
+
+This uses the excellent Pyparsing module 
+(http://pyparsing.sourceforge.net/) to parse search query strings
+into nodes from the query module.
+
+This parser handles:
+
+* 'and', 'or', 'not'
+* grouping with parentheses
+* quoted phrase searching
+* wildcards at the end of a search prefix (help*);
+
+TO DO:
+    The parser currently works by FIRST allowing pyparsing to build an
+    abstract syntax tree (AST), and then walking the AST with the
+    eval* functions to replace the AST nodes with query.* objects.
+    This is inefficient and should be replaced by attaching pyparsing
+    parseAction methods on the rules to generate query.* objects
+    directly. However, this isn't straightforward, and I don't have
+    time to work on it now. -- MattChaput
+
+This parser is based on the searchparser example code available at:
+
+http://pyparsing.wikispaces.com/space/showimage/searchparser.py
+
+This code was made available by the authors under the following copyright
+and conditions:
+
+-----
+
+Copyright (c) 2006, Estrate, the Netherlands
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without modification,
+are permitted provided that the following conditions are met:
+
+* Redistributions of source code must retain the above copyright notice, this
+  list of conditions and the following disclaimer.
+* Redistributions in binary form must reproduce the above copyright notice,
+  this list of conditions and the following disclaimer in the documentation 
+  and/or other materials provided with the distribution.
+* Neither the name of Estrate nor the names of its contributors may be used
+  to endorse or promote products derived from this software without specific
+  prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
+ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 
+LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON 
+ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 
+SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+CONTRIBUTORS:
+- Steven Mooij
+- Rudolph Froger
+- Paul McGuire
+"""
+
+import logging
+from whoosh.support.pyparsing import \
+CharsNotIn, Group, Combine, Suppress, Regex, OneOrMore, Forward, Word, alphanums, Keyword,\
+Empty, StringEnd
+
+import query
+
+def _makeParser():
+    #wordToken = Word(self.wordChars)
+    wordToken = Word(alphanums)
+    
+    # A plain old word.
+    plainWord = Group(wordToken).setResultsName("Word")
+    
+    # A word ending in a star (e.g. 'render*'), indicating that
+    # the search should do prefix expansion.
+    prefixWord = Group(Combine(wordToken + Suppress('*'))).setResultsName("Prefix")
+    
+    # A wildcard word containing * or ?.
+    wildcard = Group(Regex(r"\w*(?:[\?\*]\w*)+")).setResultsName("Wildcard")
+    
+    # A word in general is either a plain word or a prefix word.
+    generalWord = prefixWord | wildcard | plainWord
+    
+    # A quoted phrase can only contain plain words.
+    quotedPhrase = Group(Suppress('"') + OneOrMore(plainWord) + Suppress('"')).setResultsName("Quotes")
+    
+    expression = Forward()
+    
+    # Parentheses can enclose (group) any expression
+    parenthetical = Group((Suppress("(") + expression + Suppress(")"))).setResultsName("Group")
+
+    # The user can flag that a parenthetical group,
+    # quoted phrase, or word should be searched in a
+    # particular field by prepending 'fn:', where fn is
+    # the name of the field.
+    fieldableUnit = parenthetical | quotedPhrase | generalWord
+    fieldedUnit = Group(Word(alphanums) + Suppress(':') + fieldableUnit).setResultsName("Field")
+    
+    # Units of content
+    unit = fieldedUnit | fieldableUnit
+
+    # A unit may be "not"-ed.
+    operatorNot = Group(Suppress(Keyword("not", caseless=True)) + unit).setResultsName("Not")
+    generalUnit = operatorNot | unit
+
+    andToken = Keyword("and", caseless=True)
+    orToken = Keyword("or", caseless=True)
+    
+    operatorAnd = Group(generalUnit + Suppress(andToken) + expression).setResultsName("And")
+    operatorOr = Group(generalUnit + Suppress(orToken) + expression).setResultsName("Or")
+
+    expression << (OneOrMore(operatorAnd | operatorOr | generalUnit) | Empty())
+    
+    toplevel = Group(expression).setResultsName("Toplevel") + StringEnd()
+    
+    return toplevel.parseString
+
+
+class QueryParser(object):
+    def __init__(self, analyzer, default_field, conjunction = query.And):
+        self.parser = _makeParser()
+        self.conjunction = conjunction
+        self.analyzer = analyzer
+        self.default_field = default_field
+        self.original_words = set()
+        
+    def parse(self, input):
+        ast = self.parser(input)[0]
+        return self.eval(ast, self.default_field)
+    
+    def eval(self, node, fieldname):
+        name = node.getName()
+        return self.__getattribute__(name)(node, fieldname)
+        
+    def Toplevel(self, node, fieldname):
+        return self.conjunction([self.eval(s, fieldname) for s in node])
+
+    def Prefix(self, node, fieldname):
+        return query.Prefix(fieldname, node[0])
+    
+    def Wildcard(self, node, fieldname):
+        return query.Wildcard(fieldname, node[0])
+    
+    def And(self, node, fieldname):
+        return query.And([self.eval(s, fieldname) for s in node])
+    
+    def Or(self, node, fieldname):
+        return query.Or([self.eval(s, fieldname) for s in node])
+    
+    def Word(self, node, fieldname):
+        return query.Term(fieldname, node[0])
+    
+    def Group(self, node, fieldname):
+        return self.conjunction([self.eval(s, fieldname) for s in node])
+    
+    def Not(self, node, fieldname):
+        return query.Not(self.eval(node[0], fieldname))
+    
+    def Field(self, node, fieldname):
+        return self.eval(node[1], node[0])
+    
+    def Quotes(self, node, fieldname):
+        return query.Phrase(fieldname, [n[0] for n in node])
+
+
+if __name__=='__main__':
+    import analysis
+    ana = analysis.StemmingAnalyzer
+    
+    qp = QueryParser()
+    
+    b = qp.parse("a?bs*", ana, "content")
+    print b
+    print b.normalize()
+    print
+    
+    b = qp.parse("(a AND b) OR c NOT test", ana, "content")
+    print unicode(b)
+    print unicode(b.normalize())
+    print
+    
+    b = qp.parse(u'hello "there my" friend', ana, "content")
+    print b
+    print b.normalize()
+    print unicode(b)
+    print
+    
+    b = qp.parse(u"NOT funny", ana, "content")
+    print b
+    print b.normalize()
+
+

File src/whoosh/query.py

+#===============================================================================
+# Copyright 2007 Matt Chaput
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#    http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#===============================================================================
+
+import re
+from collections import defaultdict
+
+import reading
+from fields import has_positions
+
+
+class QueryError(Exception): pass
+
+
+class Query(object):
+    def get_field_num(self, schema, fieldname):
+        try:
+            return schema.by_name[fieldname].number
+        except KeyError:
+            raise QueryError("Unknown field '%s'" % fieldname)
+    
+    def get_field(self, schema, fieldname):
+        try:
+            field = schema.by_name[fieldname]
+            field_num = field.number
+            return field, field_num
+        except KeyError:
+            raise QueryError("Unknown field '%s'" % fieldname)
+
+class SimpleQuery(Query):
+    def __init__(self, fieldname, text, boost = 1.0):
+        self.fieldname = fieldname
+        self.text = text
+        self.boost = boost
+    
+    def normalize(self):
+        return self
+    
+    def __repr__(self):
+        return "%s(%s, %s, boost=%f)" % (self.__class__.__name__,
+                                         repr(self.fieldname), repr(self.text),
+                                         self.boost)
+
+    def run(self, reader, exclude_docs = None):
+        raise NotImplemented
+
+class Term(SimpleQuery):
+    def __unicode__(self):
+        return "%s:%s" % (self.fieldname, self.text)
+    
+    def run(self, reader, exclude_docs = set()):
+        field_num = self.get_field_num(reader.schema, self.fieldname)
+        try:
+            reader.find_term(field_num, self.text)
+            results = reader.weights(exclude_docs = exclude_docs, boost = self.boost)
+            return dict(results)
+            
+        except reading.TermNotFound:
+            return {}
+
+class Prefix(SimpleQuery):
+    def __unicode__(self):
+        return "%s:%s*" % (self.fieldname, self.text)
+    
+    def run(self, reader, exclude_docs = set()):
+        results = defaultdict(float)
+        
+        field_num = self.get_field_num(reader.schema, self.fieldname)
+        prefix = self.text
+        boost = self.boost
+        
+        reader.seek_term(field_num, prefix)
+        while reader.field_num == field_num and reader.text.startswith(prefix):
+            for docnum, weight in reader.weights(exclude_docs = exclude_docs):
+                results[docnum] += weight
+            reader.next()
+        
+        if boost != 1.0:
+            for docnum in results.iterkeys():
+                results[docnum] *= boost
+                
+        return results
+
+class Wildcard(SimpleQuery):
+    def __init__(self, fieldname, text, boost = 1.0):
+        super(self.__class__, self).__init__(fieldname, text, boost)
+        self.expression = re.compile(self.text.replace(".", "\\.").replace("*", ".?").replace("?", "."))
+    
+        qm = text.find("?")
+        st = text.find("*")
+        if qm < 0 and st < 0:
+            self.prefix = ""
+        elif qm < 0:
+            self.prefix = text[:st]
+        elif st < 0:
+            self.prefix = text[:qm]
+        else:
+            self.prefix = text[:min(st, qm)]
+    
+    def __repr__(self):
+        return "%s(%s, %s, boost=%f)" % (self.__class__.__name__,
+                                         repr(self.fieldname), repr(self.text),
+                                         self.boost)
+    
+    def __unicode__(self):
+        return "%s:%s" % (self.fieldname, self.text)
+    
+    def run(self, reader, exclude_docs = set()):
+        results = defaultdict(float)
+        
+        prefix = self.prefix
+        field, field_num = self.get_field(reader.schema, self.fieldname)
+        exp = self.expression
+        boost = self.boost
+        
+        reader.seek_term(field_num, prefix)
+        try:
+            while reader.field_num == field_num and reader.text.startswith(prefix) and exp.match(reader.text):
+                for docnum, weight in reader.weights(exclude_docs = exclude_docs):
+                    results[docnum] += weight
+                reader.next()
+        except StopIteration:
+            return {}
+        
+        if boost != 1.0:
+            for docnum in results.iterkeys():
+                results[docnum] *= boost
+                
+        return results
+
+class TermRange(Query):
+    def __init__(self, fieldname, start, end, boost = 1.0):
+        self.start = start
+        self.end = end
+        self.boost = boost
+    
+    def __repr__(self):
+        return '%s(%s, %s, %s, boost=%f)' % (self.__class__.__name__,
+                                             repr(self.fieldname),
+                                             repr(self.start), repr(self.end),
+                                             self.boost)
+    
+    def __unicode__(self):
+        return u"%s:%s..%s" % (self.fieldname, self.start, self.end)
+    
+    def normalize(self):
+        return self
+    
+    def run(self, reader, exclude_docs = set()):
+        results = defaultdict(float)
+        field, field_num = self.get_field(reader.schema, self.fieldname)
+        boost = self.boost
+        
+        reader.seek_term(field_num, self.start)
+        try:
+            while reader.field_num == field_num and reader.text <= self.end:
+                for docnum, weight in reader.weights(exclude_docs = exclude_docs):
+                    results[docnum] += weight
+                reader.next()
+        except StopIteration:
+            return {}
+            
+        if boost != 1.0:
+            for docnum in results.iterkeys():
+                results[docnum] *= boost
+                
+        return results
+
+class CompoundQuery(Query):
+    def __init__(self, subqueries, notqueries = None, boost = 1.0):
+        if notqueries is not None:
+            self.notqueries = notqueries
+            self.subqueries = subqueries
+        else:
+            subqs = []
+            notqs = []
+            for q in subqueries:
+                if isinstance(q, Not):
+                    notqs.append(q)
+                else:
+                    subqs.append(q)
+            
+            self.subqueries = subqs
+            self.notqueries = notqs
+        self.boost = boost
+        
+    def __repr__(self):
+        return '%s(%s, notqueries=%s, boost=%f)' % (self.__class__.__name__,
+                                                    repr(self.subqueries),
+                                                    repr(self.notqueries),
+                                                    self.boost)
+
+    def _uni(self, op):
+        r = u"("
+        r += op.join([unicode(s) for s in self.subqueries])
+        if len(self.notqueries) > 0:
+            r += " " + " ".join([unicode(s) for s in self.notqueries])
+        r += u")"
+        return r
+
+    def normalize(self):
+        if len(self.subqueries) == 1 and len(self.notqueries) == 0:
+            return self.subqueries[0].normalize()
+        
+        subqs = []
+        for s in self.subqueries:
+            s = s.normalize()
+            if isinstance(s, self.__class__):
+                subqs += s.subqueries
+            else:
+                subqs.append(s)
+                
+        notqs = []
+        for s in self.notqueries:
+            s = s.normalize()
+            notqs.append(s)
+        
+        return self.__class__(subqs, notqueries = notqs, boost = self.boost)
+
+class And(CompoundQuery):
+    def __unicode__(self):
+        return self._uni(" AND ")
+    
+    def run(self, reader, exclude_docs = set()):
+        if len(self.subqueries) == 0: return {}
+        
+        results = None
+        
+        for query in self.notqueries:
+            exclude_docs |= set(query.run(reader, exclude_docs = exclude_docs).iterkeys())
+        
+        for query in self.subqueries:
+            r = query.run(reader, exclude_docs = exclude_docs)
+            if len(r) == 0:
+                return {}
+            
+            # Initialize
+            if results is None:
+                results = dict([(docnum, value)
+                                for docnum, value in r.iteritems()
+                                if docnum not in exclude_docs])
+            # Subsequent loops
+            else:
+                if len(results) < len(r):
+                    a = results
+                    b = r
+                else:
+                    a = r
+                    b = results
+                
+                for docnum in a.keys():
+                    if docnum in b:
+                        a[docnum] += b[docnum]
+                    else:
+                        del a[docnum]
+                results = a
+                
+            if len(results) == 0:
+                return results
+        
+        if self.boost != 1.0:
+            boost = self.boost
+            for docnum in results.iterkeys():
+                results[docnum] *= boost
+        
+        return results
+    
+class Or(CompoundQuery):
+    def __unicode__(self):
+        return self._uni(" OR ")
+    
+    def run(self, reader, exclude_docs = set()):
+        if len(self.subqueries) == 0: return {}
+        
+        results = defaultdict(float)
+        boost = self.boost
+        
+        for query in self.notqueries:
+            exclude_docs |= set(query.run(reader, exclude_docs = exclude_docs).iterkeys())
+        
+        for query in self.subqueries:
+            r = query.run(reader, exclude_docs = exclude_docs)
+            for docnum in r:
+                if docnum not in exclude_docs:
+                    results[docnum] += r[docnum] * boost
+                    
+        return results
+
+class Not(Term):
+    def __init__(self, query, boost = 1.0):
+        self.query = query
+        self.boost = 1.0
+        
+    def __repr__(self):
+        return "%s(%s, boost=%f)" % (self.__class__.__name__,
+                                     repr(self.query),
+                                     self.boost)
+    
+    def __unicode__(self):
+        return "NOT " + unicode(self.query)
+    
+    def normalize(self):
+        return self
+    
+    def run(self, reader, exclude_docs = None):
+        return self.query.run(reader)
+
+#class Combination(Query):
+#    def __init__(self, required, optional, forbidden, boost = 1.0):
+#        self.required = required
+#        self.optional = optional
+#        self.forbidden = forbidden
+#        self.boost = boost
+#        
+#    def __repr__(self):
+#        return "%s(%s, %s, %s, boost=%f)" % (self.__class__.__name__,
+#                                             self.required,
+#                                             self.optional,
+#                                             self.forbidden,
+#                                             self.boost)
+#    
+#    def run(self, reader, exclude_docs = None):
+#        if not exclude_docs:
+#            exclude_docs = set()
+#        if self.forbidden:
+#            exclude_docs |= set(self.forbidden.run(reader, exclude_docs = exclude_docs).keys())
+#        
+#        boost = self.boost
+#        
+#        if self.required:
+#            results = self.required.run(reader, exclude_docs = exclude_docs)
+#            if boost != 1.0:
+#                for docnum in results.iterkeys():
+#                    results[docnum] *= boost
+#        
+#        if self.optional:
+#            for docnum, value in self.optional.run(reader, exclude_docs = exclude_docs):
+#                if (not self.required or docnum in results) and docnum not in exclude_docs:
+#                    results[docnum] += value * boost
+#        
+#        return results
+
+class Phrase(Query):
+    def __init__(self, fieldname, words, boost = 1.0, slop = 1):
+        for w in words:
+            if not isinstance(w, unicode):
+                raise ValueError("'%s' is not unicode" % w)
+        
+        self.fieldname = fieldname
+        self.words = words
+        self.boost = boost
+        
+    def __repr__(self):
+        return "%s(%s, %s, boost=%f)" % (self.__class__.__name__,
+                                         repr(self.fieldname),
+                                         repr(self.words),
+                                         self.boost)
+        
+    def __unicode__(self):
+        return u'%s:"%s"' % (self.fieldname, " ".join(self.words))
+    
+    def normalize(self):
+        return self
+    
+    def run(self, reader, exclude_docs = set()):
+        if len(self.words) == 0: return {}
+        
+        field, field_num = self.get_field(reader.schema, self.fieldname)
+        if not has_positions(field):
+            raise QueryError("'%s' field does not support phrase searching")
+        
+        current = {}
+        for w in self.words:
+            try:
+                reader.find_term(field_num, w)
+                
+                if current == {}:
+                    for docnum, positions in reader.positions():
+                        if docnum not in exclude_docs:
+                            current[docnum] = positions
+                else:
+                    for docnum, positions in reader.positions():
+                        if docnum in current:
+                            new_poses = []
+                            for pos in current[docnum]:
+                                if pos + 1 in positions:
+                                    new_poses.append(pos + 1)
+                            
+                            if len(new_poses) > 0:
+                                current[docnum] = new_poses
+                            else:
+                                del current[docnum]
+                
+            except reading.TermNotFound:
+                return {}
+
+        if current and len(current) > 0:
+            return dict([(docnum, len(positions) * self.boost)
+                          for docnum, positions in current.iteritems()])
+        else:
+            return {}
+
+
+
+
+
+
+
+
+
+
+
+
+        
+
+

File src/whoosh/reading.py

+#===============================================================================
+# Copyright 2007 Matt Chaput
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#    http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#===============================================================================
+
+from bisect import bisect_left, insort_left
+
+import structfile
+
+_unsignedlong_size = structfile._unsignedlong_size
+_int_size = structfile._int_size
+
+class TermNotFound(Exception): pass
+
+class SegmentReader(object):
+    def __init__(self, index, segment):
+        self.storage = index.storage
+        self.schema = index.schema
+        self.segment = segment
+        
+        self.doc_index = self.storage.open_file(segment.name + ".dcx")
+        self.doc_file = self.storage.open_file(segment.name + ".dcs")
+        self.term_index = self.storage.open_file(segment.name + ".tix")
+        self.post_file = self.storage.open_file(segment.name + ".pst")
+    
+    def close(self):
+        self.docs_index.close()
+        self.docs_file.close()
+        self.term_index.close()
+        self.post_file.close()
+    
+    def has_deletions(self):
+        return self.segment.has_deletions()
+    
+    def is_deleted(self, docnum):
+        return self.segment.is_deleted(docnum)
+    
+    def doc_reader(self):
+        return DocReader(self.doc_index, self.doc_file)
+    
+    def term_reader(self):
+        return TermReader(self.segment, self.schema, self.term_index, self.post_file)
+    
+    def run_query(self, q):
+        return q.run(self.term_reader())
+
+class MultiSegmentReader(SegmentReader):
+    def __init__(self, index, segments):
+        self.index = index
+        self.segments = segments
+        self.readers = None
+        
+    def close(self):
+        if self.readers:
+            for r in self.readers: r.close()
+    
+    def _get_readers(self):
+        if not self.readers:
+            self.readers = [SegmentReader(self.index, s) for s in self.segments]
+        return self.readers
+    
+    def doc_reader(self):
+        return MultiDocReader([s.doc_reader() for s in self._get_readers()])
+    
+    def term_reader(self):
+        return MultiTermReader([s.term_reader() for s in self._get_readers()],
+                               self.index.doc_offsets)                           
+
+
+class DocReader(object):
+    def __init__(self, doc_index, doc_file):
+        self.doc_index = doc_index
+        self.doc_file = doc_file
+        
+    def next(self):
+        control = self.doc_file.read_byte()
+        if control == 0:
+            self.term_count_multiplied = self.term_count_actual = self.doc_file.read_int()
+        else:
+            self.term_count_multiplied = self.doc_file.read_float()
+            self.term_count_actual = self.doc_file.read_int()
+        
+        self.payload = self.doc_file.read_pickle()
+        return (self.term_count_multiplied, self.term_count_actual, self.payload)
+    
+    def reset(self):
+        self.doc_file.seek(0)
+        
+    def __iter__(self):
+        self.reset()
+        try:
+            while True:
+                yield self.next()
+        except structfile.EndOfFile:
+            raise StopIteration
+        
+    def __getitem__(self, docnum):
+        self.doc_index.seek(docnum * _unsignedlong_size)
+        self.doc_file.seek(self.doc_index.read_ulong())
+        return self.next()
+    
+class MultiDocReader(object):
+    def __init__(self, doc_readers):
+        self.doc_readers = doc_readers
+        self.term_count_multiplied = None
+        self.term_count_actual = None
+        self.payload = None
+        self.current = 0
+        self.reset()
+    
+    def reset(self):
+        for r in self.docReaders:
+            r.reset()
+        self.current = 0
+    
+    def next(self):
+        if self.current > len(self.doc_readers):
+            return
+        
+        try:
+            self.term_count_multiplied, self.term_count_actual, self.payload = self.doc_readers[self.current].next()
+            return (self.term_count_multiplied, self.term_count_actual, self.payload)
+        except structfile.EndOfFile:
+            self.current += 1
+            if self.current >= len(self.doc_readers):
+                return
+            return self.next()
+
+class TermReader(object):
+    def __init__(self, segment, schema, term_index, post_file):
+        self.segment = segment
+        self.schema = schema
+        self.term_index = term_index
+        self.post_file = post_file
+        
+        self.version = None
+        self.postfile_offset = None
+        
+        self.index_skips()
+        self.reset()
+    
+    def reset(self):
+        term_index = self.term_index
+        term_index.seek(0)
+        
+        self.version = term_index.read_int()
+        assert self.version == -100
+        
+        term_index.read_int() # Reserved
+        term_index.read_int() # Reserved
+        term_index.read_int() # Reserved
+        term_index.read_int() # Reserved
+        
+        self.state = 0 # 0 = on skip pointer, 1 = in block, -1 = last block
+        self.next_block = 0
+    
+    def index_skips(self):
+        self.reset()
+        term_index = self.term_index
+        
+        skiplist = None #[(0, '', term_index.tell())]
+        
+        while True:
+            here = term_index.tell()
+            pointer = term_index.read_ulong()
+            
+            text = term_index.read_string()
+            field_num = term_index.read_varint()
+            
+            if skiplist is None: skiplist = [(0, '', here)]
+            skiplist.append((field_num, text, here))
+            
+            if pointer == 0:
+                break
+            term_index.seek(pointer)
+        
+        self.skiplist = skiplist
+    
+    def find_term(self, field_num, text):
+        try:
+            #print "find_term:", field_num, text
+            self.seek_term(field_num, text)
+            #print "       at:", self.field_num, self.text 
+            if not(self.field_num == field_num and self.text == text):
+                raise TermNotFound
+        except structfile.EndOfFile:
+            raise TermNotFound
+    
+    def seek_term(self, field_num, text):
+        skipindex = bisect_left(self.skiplist, (field_num, text)) - 1
+        assert skipindex >= 0
+        self.term_index.seek(self.skiplist[skipindex][2])
+        self.state = 0
+        self.next()
+        
+        if not (self.field_num == field_num and self.text == text):
+            while self.field_num < field_num or (self.field_num == field_num and self.text < text):
+                self.next()
+    
+    def __iter__(self):
+        try:
+            while True:
+                yield self.next()
+        except structfile.EndOfFile:
+            raise StopIteration
+    
+    def next(self):
+        term_index = self.term_index
+        
+        if self.state == 1 and term_index.tell() == self.next_block:
+            self.state = 0
+            
+        if self.state == 0:
+            self.next_block = term_index.read_ulong()
+            if self.next_block == 0:
+                self.state = -1
+            else:
+                self.state = 1
+                
+        self.text = term_index.read_string().decode("utf8")
+        self.field_num = term_index.read_varint()
+        self.current_field = self.schema.by_number[self.field_num]
+        
+        self.doc_freq = term_index.read_varint()
+        self.postfile_offset = term_index.read_ulong()
+    
+        return (self.field_num, self.text, self.doc_freq)
+    
+    def postings(self, exclude_docs = set()):
+        is_deleted = self.segment.is_deleted
+        
+        post_file = self.post_file
+        post_file.seek(self.postfile_offset)
+        
+        docnum = 0
+        for i in xrange(0, self.doc_freq): #@UnusedVariable
+            delta = post_file.read_varint()
+            docnum += delta
+            
+            data = self.current_field.read_postvalue(post_file)
+            
+            if not is_deleted(docnum) and not docnum in exclude_docs:
+                yield docnum, data
+    
+    def weights(self, exclude_docs = set(), boost = 1.0):
+        field = self.current_field
+        for docnum, data in self.postings(exclude_docs = exclude_docs):
+            yield (docnum, field.data_to_weight(data) * boost)
+
+    def positions(self, exclude_docs = set(), boost = 1.0):
+        field = self.current_field
+        for docnum, data in self.postings(exclude_docs = exclude_docs):
+            yield (docnum, field.data_to_positions(data))
+
+class MultiTermReader(TermReader):
+    def __init__(self, term_readers, doc_offsets):
+        self.term_readers = term_readers
+        self.doc_offsets = doc_offsets
+        self.reset()
+        
+    def reset(self):
+        self.waitlist = None
+        self.current_readers = None
+        
+        for r in self.term_readers:
+            r.reset()
+    
+    def index_skips(self):
+        raise NotImplemented
+    
+    def seek_term(self, field_num, text):
+        for r in self.term_readers:
+            r.reset()
+            r.seek_term(field_num, text)
+            
+    def __iter__(self):
+        while True:
+            # This isn't really an infinite loop... it will
+            # eventually pass through a StopIteration exception
+            # from the next() method.
+            
+            yield self.next()
+            
+    def next(self):
+        # This method does a merge sort of the terms coming off
+        # the sub-readers.
+
+        # The waiting list is a list of the "head" term from each
+        # sub-reader, so we can sort them. On the first call to
+        # next(), this is None.
+        waitlist = self.waitlist
+        
+        # On first run, we need to fill in the waiting list. We do
+        # this by making the code that follows, which replenishes
+        # the list, replenish from ALL readers.
+        if waitlist is None:
+            waitlist = []
+            self.current_readers = self.term_readers[:]
+        
+        # Replace the terms taken in the last iteration with new
+        # terms. On the first call to next(), thanks to the code
+        # above, this initializes the waiting list.
+        
+        if self.current_readers:
+            for r in self.current_readers:
+                field_num, text, doc_count = r.next()
+                insort_left(waitlist, (field_num, text, doc_count, r))
+        
+        # Take the lowest term from the head of the waiting list.
+        current = waitlist[0]
+        
+        # Set this reader's attributes to those of the term
+        
+        self.field_num = current[0]
+        self.field = self.schema.by_number[self.field_num]
+        self.text = current[1]
+        doc_count = current[2]
+        
+        # We need to calculate the doc_count (doc frequency) by
+        # adding up the doc counts from each reader with this
+        # term. (If several readers include the same term, each
+        # copy of the term should be at the head of the waiting list
+        # right now).
+        
+        right = 1
+        while right < len(waitlist) and waitlist[right][0] == field_num and waitlist[right][1] == text:
+            doc_count += waitlist[right][2]
+            right += 1
+        
+        self.doc_count = doc_count
+        
+        # Remember the readers that have the "current" term.
+        # We'll need to iterate through them to get postings.
+        # And on the next call to next(), we'll need to get
+        # new terms from them to replenish the waiting list.
+        
+        self.currentReaders = [r for
+                               fieldNum, text, docCount, r
+                               in waitlist[:right] ]
+        
+        # Now that the readers are recorded, we can remove
+        # the (one or more copies of the) current term from the
+        # waiting list.
+        self.waitlist = waitlist[right:]
+        
+        return (fieldNum, text, docCount)
+        
+    def postings(self):
+        for i, r in enumerate(self.current_readers):
+            for docnum, data in r.postings():
+                self.docnum = docnum + self.doc_offsets[i]
+                self.data = data
+                yield self.docnum, data
+                
+    
+
+
+
+