Commits

Shoji KUMAGAI committed e90e10c

2010.10.11 shkumagai Split a file to some files.

Comments (0)

Files changed (6)

source/chapter01/a_first_take_at_building_an_inverted_index.rst

+.. _a_first_take_an_building_an_inverted_index:
+
+A first take at building an inverted index
+==========================================
+
+.. To gain the speed benefits of indexing at retrieval time, we have to build the index in
+   advance. The major steps in this are:
+
+1. Collect the documents to be indexed:
+Friends, Romans, countrymen. So let it be with Caesar ...
+
+2. Tokenize the text, turning each document into a list of tokens:
+Friends
+Romans
+countrymen
+So
+
+3. Do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms:	. . .
+friend
+roman
+countryman
+so
+
+4. Index the documents that each term occurs in by creating an inverted in- dex, consisting of a dictionary and postings.
+
+.. INVERTED INDEX
+.. DICTIONARY VOCABULARY LEXICON
+.. POSTING POSTINGS LIST
+.. POSTINGS
+
+Brutus → [1][2][4][11][31][45][173][174]
+Caesar → [1][2][4][5][6][16][57][132] ...
+Calpurnia → [2][31][54][101]
+
+Dictionary	Postings 
+
+◮ Figure 1.3	The two parts of an inverted index. The dictionary is commonly kept
+in memory, with pointers to each postings list, which is stored on disk.
+
+.. DOCID
+.. SORTING
+.. DOCUMENT FREQUENCY
+
+.. We will define and discuss the earlier stages of processing, that is, steps 1–3, in
+   Section 2.2 (page 22). Until then you can think of tokens and normalized tokens as also
+   loosely equivalent to words. Here, we assume that the first 3 steps have already been
+   done, and we examine building a basic inverted index by sort-based indexing.
+
+.. Within a document collection, we assume that each document has a unique serial number,
+   known as the document identifier (docID). During index construction, we can simply assign
+   successive integers to each new document when it is first encountered.
+   The input to indexing is a list of normalized tokens for each document, which we can
+   equally think of as a list of pairs of term and docID, as in Figure 1.4. The core indexing
+   step is sorting this list so that the terms are alphabetical, giving us the representation
+   in the middle column of Figure 1.4. Multiple occurrences of the same term from the same
+   document are then merged. [5]_ Instances of the same term are then grouped, and the result
+   is split into a dictionary and postings, as shown in the right column of Figure 1.4.
+   Since a term generally occurs in a number of documents, this data organization already
+   reduces the storage requirements of the index. The dictionary also records some statistics,
+   such as the number of documents which contain each term (the document frequency, which is
+   here also the length of each postings list). This information is not vital for a basic
+   Boolean search engine, but it allows us to improve the efficiency of the search engine at
+   query time, and it is a statistic later used in many ranked retrieval models. The postings
+   are secondarily sorted by docID. This provides the basis for efficient query processing.
+   This inverted index structure is es- sentially without rivals as the most efficient
+   structure for supporting ad hoc text search.
+
+.. In the resulting index, we pay for storage of both the dictionary and the postings lists.
+   The latter are much larger, but the dictionary is commonly kept in memory, while postings
+   lists are normally kept on disk, so the size of each is important, and in Chapter 5 we will
+   examine how each can be optimized for storage and access efficiency. What data structure
+   should be used for a postings list? A fixed length array would be wasteful as some words
+   occur in many documents, and others in very few. For an in-memory postings list, two good
+   alternatives are singly linked lists or variable length arrays. Singly linked lists allow
+   cheap insertion of documents into postings lists (following updates, such as when recrawling
+   the web for updated documents), and naturally extend to more advanced indexing strategies
+   such as skip lists (Section 2.3), which require additional pointers. Variable length arrays
+   win in space requirements by avoiding the overhead for pointers and in time requirements
+   because their use of contiguous memory increases speed on modern processors with memory
+   caches. Extra pointers can in practice be encoded into the lists as offsets.
+   If updates are relatively infrequent, variable length arrays will be more compact and faster
+   to traverse. We can also use a hybrid scheme with a linked list of fixed length arrays for
+   each term. When postings lists are stored on disk, they are stored (perhaps compressed)
+   as a contiguous run of postings without explicit pointers (as in Figure 1.3), so as to
+   minimize the size of the postings list and the number of disk seeks to read a postings list
+   into memory.
+
+
+.. Doc 1
+   I did enact Julius Caesar: I was killed i’ the Capitol; Brutus killed me.
+   Doc 2
+   So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious:
+
+.. Figure 1.4 Building an index by sorting and grouping. The sequence of terms in each document,
+   tagged by their documentID (left) is sorted alphabetically (mid- dle). Instances of the same
+   term are then grouped by word and then by documentID. The terms and documentIDs are then
+   separated out (right). The dictionary stores the terms, and has a pointer to the postings list
+   for each term. It commonly also stores other summary information such as, here, the document
+   frequency of each term. We use this information for improving query time efficiency and,
+   later, for weighting in ranked retrieval models. Each postings list stores the list of
+   documents in which a term occurs, and may store other information such as the term frequency
+   (the frequency of each term in each document) or the position(s) of the term in each document.
+
+
+.. ?
+.. Exercise 1.1  [⋆]
+.. Draw the inverted index that would be built for the following document collection.
+   (See Figure 1.3 for an example.)
+
+   Doc 1 new home sales top forecasts
+   Doc 2 home sales rise in july
+   Doc 3 increase in home sales in july
+   Doc 4 july new home sales rise
+
+.. Exercise 1.2  [⋆]
+.. Consider these documents:
+
+   Doc 1 breakthrough drug for schizophrenia
+   Doc 2 new schizophrenia drug
+   Doc 3 new approach for treatment of schizophrenia
+   Doc 4 new hopes for schizophrenia patients
+
+
+a. Drawtheterm-documentincidencematrixforthisdocumentcollection.
+b. Drawtheinvertedindexrepresentationforthiscollection,asinFigure1.3(page7).
+
+
+.. Figure 1.5  Intersecting the postings lists for Brutus and Calpurnia from Figure 1.3.
+
+.. Exercise 1.3  [⋆]
+.. For the document collection shown in Exercise 1.2, what are the returned results for
+   these queries:
+   a. schizophrenia AND drug
+   b. for AND NOT(drug OR approach)
+
+.. [5] Unix users can note that these steps are similar to use of the sort and then uniq commands.
+
+.. END

source/chapter01/an_example_information_retrieval_problem.rst

 .. _an_example_information_retrieval_problem:
 
-
+An example information retrieval problem
 ========================================
 
 .. A fat book which many people own is Shakespeare’s Collected Works. Suppose you wanted
    They might be individual memos or chapters of a book (see Section 2.1.2 (page 20) for
    further discussion). We will refer to the group of documents over which we perform retrieval
    as the (document) collection. It is sometimes also referred to as a corpus (a body of texts).
-   Suppose each document is about 1000 words long (2–3 book pages). If
+   Suppose each document is about 1000 words long (2–3 book pages). If we assume an average of
+   6 bytes per word including spaces and punctuation, then this is a document collection about
+   6 GB in size. Typically, there might be about M = 500,000 distinct terms in these documents.
+   There is nothing special about the numbers we have chosen, and they might vary by an order
+   of magnitude or more, but they give us some idea of the dimensions of the kinds of problems
+   we need to handle. We will discuss and model these size assumptions in Section 5.1 (page 86).
 
-Antony and Cleopatra
-Julius Caesar
-Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 ...
 .. AD HOC RETRIEVAL
 .. INFORMATION NEED QUERY
 .. RELEVANCE
 .. PRECISION
 .. RECALL
 
-Antony and Cleopatra, Act III, Scene ii
-Agrippa [Aside to Domitius Enobarbus]:	Why, Enobarbus, When Antony found Julius Caesar dead,
-He cried almost to roaring; and he wept When at Philippi he found Brutus slain.
+.. Figure 1.2 Results from Shakespeare for the query Brutus AND Caesar AND NOT Calpurnia.
 
-Hamlet, Act III, Scene ii
-Lord Polonius: I did enact Julius Caesar: I was killed i’ the Capitol; Brutus killed me.
 
-◮ Figure 1.2	Results from Shakespeare for the query Brutus AND Caesar AND NOT
-Calpurnia.
 
-.. we assume an average of 6 bytes per word including spaces and punctuation, then this is a document collection about 6 GB in size. Typically, there might be about M = 500,000 distinct terms in these documents. There is nothing special about the numbers we have chosen, and they might vary by an order of magnitude or more, but they give us some idea of the dimensions of the kinds of problems we need to handle. We will discuss and model these size assumptions in Section 5.1 (page 86).
+.. Our goal is to develop a system to address the ad hoc retrieval task. This is the most
+   standard IR task. In it, a system aims to provide documents from within the collection
+   that are relevant to an arbitrary user information need, communicated to the system by
+   means of a one-off, user-initiated query. An information need is the topic about which
+   the user desires to know more, and is differentiated from a query, which is what the user
+   conveys to the com- puter in an attempt to communicate the information need. A document
+   is relevant if it is one that the user perceives as containing information of value with
+   respect to their personal information need. Our example above was rather artificial in
+   that the information need was defined in terms of particular words, whereas usually a user
+   is interested in a topic like “pipeline leaks” and would like to find relevant documents
+   regardless of whether they precisely use those words or express the concept with other
+   words such as pipeline rupture. To assess the effectiveness of an IR system (i.e.,
+   the quality of its search results), a user will usually want to know two key statistics
+   about the system’s returned results for a query:
 
-.. Our goal is to develop a system to address the ad hoc retrieval task. This is the most standard IR task. In it, a system aims to provide documents from within the collection that are relevant to an arbitrary user information need, communicated to the system by means of a one-off, user-initiated query. An information need is the topic about which the user desires to know more, and is differentiated from a query, which is what the user conveys to the com- puter in an attempt to communicate the information need. A document is relevant if it is one that the user perceives as containing information of value with respect to their personal information need. Our example above was rather artificial in that the information need was defined in terms of par- ticular words, whereas usually a user is interested in a topic like “pipeline leaks” and would like to find relevant documents regardless of whether they precisely use those words or express the concept with other words such as pipeline rupture. To assess the effectiveness of an IR system (i.e., the quality of its search results), a user will usually want to know two key statistics about the system’s returned results for a query:
+  *Precision:* What fraction of the returned results are relevant to the information need?
 
-  Precision: What fraction of the returned results are relevant to the informa- tion need?
-  Recall: What fraction of the relevant documents in the collection were re- turned by the system?
+  *Recall:* What fraction of the relevant documents in the collection were returned
+  by the system?
 
-.. Detailed discussion of relevance and evaluation measures including preci- sion and recall is found in Chapter 8.
+.. Detailed discussion of relevance and evaluation measures including precision and recall
+   is found in Chapter 8.
 
-.. We now cannot build a term-document matrix in a naive way. A 500K × 1M matrix has half-a-trillion 0’s and 1’s – too many to fit in a computer’s memory. But the crucial observation is that the matrix is extremely sparse, that is, it has few non-zero entries. Because each document is 1000 words long, the matrix has no more than one billion 1’s, so a minimum of 99.8% of the cells are zero. A much better representation is to record only the things that do occur, that is, the 1 positions.
+.. We now cannot build a term-document matrix in a naive way. A 500K × 1M matrix has
+   half-a-trillion 0’s and 1’s – too many to fit in a computer’s memory. But the crucial
+   observation is that the matrix is extremely sparse, that is, it has few non-zero entries.
+   Because each document is 1000 words long, the matrix has no more than one billion 1’s,
+   so a minimum of 99.8% of the cells are zero. A much better representation is to record
+   only the things that do occur, that is, the 1 positions.
 
-.. This idea is central to the first major concept in information retrieval, the inverted index. The name is actually redundant: an index always maps back from terms to the parts of a document where they occur. Nevertheless, in- verted index, or sometimes inverted file, has become the standard term in infor- mation retrieval. [3]_ The basic idea of an inverted index is shown in Figure 1.3. We keep a dictionary of terms (sometimes also referred to as a vocabulary or lexicon; in this book, we use dictionary for the data structure and vocabulary for the set of terms). Then for each term, we have a list that records which documents the term occurs in. Each item in the list – which records that a term appeared in a document (and, later, often, the positions in the docu- ment) – is conventionally called a posting. [4]_	The list is then called a postings list (or inverted list), and all the postings lists taken together are referred to as the postings. The dictionary in Figure 1.3 has been sorted alphabetically and each postings list is sorted by document ID. We will see why this is useful in Section 1.3, below, but later we will also consider alternatives to doing this (Section 7.1.5).
+.. This idea is central to the first major concept in information retrieval, the inverted
+   index. The name is actually redundant: an index always maps back from terms to the parts
+   of a document where they occur. Nevertheless, inverted index, or sometimes inverted file,
+   has become the standard term in information retrieval. [3]_ The basic idea of an inverted
+   index is shown in Figure 1.3. We keep a dictionary of terms (sometimes also referred to
+   as a vocabulary or lexicon; in this book, we use dictionary for the data structure and
+   vocabulary for the set of terms). Then for each term, we have a list that records which
+   documents the term occurs in. Each item in the list – which records that a term appeared
+   in a document (and, later, often, the positions in the document) – is conventionally
+   called a posting. [4]_ The list is then called a postings list (or inverted list), and
+   all the postings lists taken together are referred to as the postings. The dictionary in
+   Figure 1.3 has been sorted alphabetically and each postings list is sorted by document ID.
+   We will see why this is useful in Section 1.3, below, but later we will also consider
+   alternatives to doing this (Section 7.1.5).
 
+.. [2] Formally, we take the transpose of the matrix to be able to get the terms as column
+       vectors.
 
-A first take at building an inverted index
-==========================================
+.. [3] Some information retrieval researchers prefer the term inverted file, but expressions
+       like in dex construction and index compression are much more common than inverted file
+       construction and inverted file compression. For consistency, we use (inverted) index
+       throughout this book. 
 
-.. To gain the speed benefits of indexing at retrieval time, we have to build the index in advance. The major steps in this are:
+.. [4] In a (non-positional) inverted index, a posting is just a document ID, but it is
+       inherently associated with a term, via the postings list it is placed on; sometimes
+       we will also talk of a (term, docID) pair as a posting.
 
-1. Collect the documents to be indexed:
-Friends, Romans, countrymen. So let it be with Caesar ...
-
-2. Tokenize the text, turning each document into a list of tokens:
-Friends
-Romans
-countrymen
-So
-
-3. Do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms:	. . .
-friend
-roman
-countryman
-so
-
-4. Index the documents that each term occurs in by creating an inverted in- dex, consisting of a dictionary and postings.
-
-.. INVERTED INDEX
-.. DICTIONARY VOCABULARY LEXICON
-.. POSTING POSTINGS LIST
-.. POSTINGS
-
-Brutus → [1][2][4][11][31][45][173][174]
-Caesar → [1][2][4][5][6][16][57][132] ...
-Calpurnia → [2][31][54][101]
-
-Dictionary	Postings 
-
-◮ Figure 1.3	The two parts of an inverted index. The dictionary is commonly kept
-in memory, with pointers to each postings list, which is stored on disk.
-
-.. DOCID
-.. SORTING
-.. DOCUMENT FREQUENCY
-
-.. We will define and discuss the earlier stages of processing, that is, steps 1–3, in Section 2.2 (page 22). Until then you can think of tokens and normalized tokens as also loosely equivalent to words. Here, we assume that the first 3 steps have already been done, and we examine building a basic inverted index by sort-based indexing.
-
-.. Within a document collection, we assume that each document has a unique serial number, known as the document identifier (docID). During index con- struction, we can simply assign successive integers to each new document when it is first encountered. The input to indexing is a list of normalized tokens for each document, which we can equally think of as a list of pairs of term and docID, as in Figure 1.4. The core indexing step is sorting this list so that the terms are alphabetical, giving us the representation in the middle column of Figure 1.4. Multiple occurrences of the same term from the same document are then merged. [5]_	Instances of the same term are then grouped, and the result is split into a dictionary and postings, as shown in the right column of Figure 1.4. Since a term generally occurs in a number of docu- ments, this data organization already reduces the storage requirements of the index. The dictionary also records some statistics, such as the number of documents which contain each term (the document frequency, which is here also the length of each postings list). This information is not vital for a ba- sic Boolean search engine, but it allows us to improve the efficiency of the
-
-
-Doc 1
-I did enact Julius Caesar: I was killed i’ the Capitol; Brutus killed me.
-Doc 2
-So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious:
-term docID
-term docID
-I	1 did	1 enact	1 julius	1 caesar	1 I 1 was	1 killed	1 i’ 1 the	1 capitol	1 brutus 1 killed 1 me 1=⇒i’ so 2 it
-2 2 1 2 1 1 2 2 1 1 1 1 1
-ambitious be brutus brutus capitol caesar caesar caesar did
-term
-doc. freq.	→ → 1→
-postings lists
-2
-2
-1
-1
-1
-1
-1
-2
-1
-1
-2
-1
-1
-2
-1
-2
-2
-1
-2
-2
-1
-2
-ambitious 1
-be
-2
-brutus 2	→	→ capitol 1	→
-2
-caesar 2 →
-1→
-enact 1	→
-hath1→
-1→
-1→
-→
-did
-enact hath I I
-julius killed killed let me noble so the the told you was was with
-1=⇒ 1 →
-I
-i’
-it
-2 julius1 →
-let	2 it	2 be	2 with 2 caesar	2 the	2 noble 2 brutus	2 hath 2 told	2 you 2 caesar	2 was 2 ambitious	2
-◮ Figure 1.4 in each document, tagged by their documentID (left) is sorted alphabetically (mid- dle). Instances of the same term are then grouped by word and then by documentID. The terms and documentIDs are then separated out (right). The dictionary stores the terms, and has a pointer to the postings list for each term. It commonly also stores other summary information such as, here, the document frequency of each term. We use this information for improving query time efficiency and, later, for weighting in ranked retrieval models. Each postings list stores the list of documents in which a term occurs, and may store other information such as the term frequency (the frequency of each term in each document) or the position(s) of the term in each document.
-Online edition (c)␣2009 Cambridge UP
-1 1 1	1→ 2 1 2 2 1 2 told1 → 2 2 1 2 2
-killed 1	→
-let
-me
-1→ noble 1	→ so1 →
-2
-the2 →→
-you1 → was2 →→ with1 →
-2
-Building an index by sorting and grouping. The sequence of terms
-page:
-1.2	A first take at building an inverted index	9
-search engine at query time, and it is a statistic later used in many ranked re- trieval models. The postings are secondarily sorted by docID. This provides the basis for efficient query processing. This inverted index structure is es- sentially without rivals as the most efficient structure for supporting ad hoc text search.
-In the resulting index, we pay for storage of both the dictionary and the postings lists. The latter are much larger, but the dictionary is commonly kept in memory, while postings lists are normally kept on disk, so the size of each is important, and in Chapter 5 we will examine how each can be optimized for storage and access efficiency. What data structure should be used for a postings list? A fixed length array would be wasteful as some words occur in many documents, and others in very few. For an in-memory postings list, two good alternatives are singly linked lists or variable length arrays. Singly linked lists allow cheap insertion of documents into postings lists (following updates, such as when recrawling the web for updated doc- uments), and naturally extend to more advanced indexing strategies such as skip lists (Section 2.3), which require additional pointers. Variable length ar- rays win in space requirements by avoiding the overhead for pointers and in time requirements because their use of contiguous memory increases speed on modern processors with memory caches. Extra pointers can in practice be encoded into the lists as offsets. If updates are relatively infrequent, variable length arrays will be more compact and faster to traverse. We can also use a hybrid scheme with a linked list of fixed length arrays for each term. When postings lists are stored on disk, they are stored (perhaps compressed) as a contiguous run of postings without explicit pointers (as in Figure 1.3), so as to minimize the size of the postings list and the number of disk seeks to read a postings list into memory.
-? Exercise 1.1	[⋆]
-Draw the inverted index that would be built for the following document collection. (See Figure 1.3 for an example.)
-Doc 1 Doc 2 Doc 3 Doc 4
-new home sales top forecasts home sales rise in july increase in home sales in july july new home sales rise
-Exercise 1.2	[⋆] Consider these documents:
-Doc 1 Doc 2 Doc 3 Doc 4
-breakthrough drug for schizophrenia new schizophrenia drug new approach for treatment of schizophrenia new hopes for schizophrenia patients
-a. Drawtheterm-documentincidencematrixforthisdocumentcollection.
-Online edition (c)␣2009 Cambridge UP
-page:
-10
-1	Boolean retrieval
-−→ → → → → → → → −→	→	→	→
-1
-2
-4
-11
-31
-45
-173
-174
-Brutus Calpurnia
-Intersection ◮ Figure 1.5	Intersecting the postings lists for Brutus and Calpurnia from Figure 1.3.
-b. Drawtheinvertedindexrepresentationforthiscollection,asinFigure1.3(page7). Exercise 1.3	[⋆]
-For the document collection shown in Exercise 1.2, what are the returned results for these queries:
-a. schizophrenia AND drug b. for AND NOT(drug OR approach)
-Processing Boolean queries
-How do we process a query using an inverted index and the basic Boolean retrieval model? Consider processing the simple conjunctive query:
-Brutus AND Calpurnia over the inverted index partially shown in Figure 1.3 (page 7). We: 1. Locate Brutus in the Dictionary 2. Retrieve its postings 3. Locate Calpurnia in the Dictionary 4. Retrieve its postings 5. Intersect the two postings lists, as shown in Figure 1.5.
-The intersection operation is the crucial one: we need to efficiently intersect postings lists so as to be able to quickly find documents that contain both terms. (This operation is sometimes referred to as merging postings lists: this slightly counterintuitive name reflects using the term merge algorithm for a general family of algorithms that combine multiple sorted lists by inter- leaved advancing of pointers through each; here we are merging the lists with a logical AND operation.)
-There is a simple and effective method of intersecting postings lists using the merge algorithm (see Figure 1.6): we maintain pointers into both lists
-2
-31
-54
-101
-2
-31
-=⇒	→
-1.3
-SIMPLE CONJUNCTIVE QUERIES (1.1)
-POSTINGS LIST INTERSECTION
-POSTINGS MERGE
-Online edition (c)␣2009 Cambridge UP
-page:
-(1.2)
-QUERY OPTIMIZATION
-(1.3)
-1.3	Processing Boolean queries	11 INTERSECT(p1, p2)
-answer←⟨⟩ while p1 ̸= NIL and p2 ̸= NIL do if docID(p1) = docID(p2)
-then ADD(answer, docID(p1)) p1 ← next(p1)
-p2 ← next(p2) else if docID(p1) < docID(p2)
-then p1 ← next(p1)
-1 2 3 4 5 6 7 8 9
-10 ◮ Figure 1.6	Algorithm for the intersection of two postings lists p1 and p2.
-and walk through the two postings lists simultaneously, in time linear in the total number of postings entries. At each step, we compare the docID pointed to by both pointers. If they are the same, we put that docID in the results list, and advance both pointers. Otherwise we advance the pointer pointing to the smaller docID. If the lengths of the postings lists are x and y, the intersection takes O(x + y) operations. Formally, the complexity of querying is Θ(N), where N is the number of documents in the collection.6 Our indexing methods gain us just a constant, not a difference in Θ time complexity compared to a linear scan, but in practice the constant is huge. To use this algorithm, it is crucial that postings be sorted by a single global ordering. Using a numeric sort by docID is one simple way to achieve this.
-We can extend the intersection operation to process more complicated queries like:
-(Brutus OR Caesar) AND NOT Calpurnia
-Query optimization is the process of selecting how to organize the work of an- swering a query so that the least total amount of work needs to be done by the system. A major element of this for Boolean queries is the order in which postings lists are accessed. What is the best order for query processing? Con- sider a query that is an AND of t terms, for instance:
-Brutus AND Caesar AND Calpurnia For each of the t terms, we need to get its postings, then AND them together.
-The standard heuristic is to process terms in order of increasing document
-6. The notation Θ(·) is used to express an asymptotically tight bound on the complexity of an algorithm. Informally, this is often written as O(·), but this notation really expresses an asymptotic upper bound, which need not be tight (Cormen et al. 1990).
-else p2 ← next(p2) return answer
-Online edition (c)␣2009 Cambridge UP
-page:
-12
-1
-Boolean retrieval
-(1.4)
-(1.5)
-◮ Figure 1.7	Algorithm for conjunctive queries that returns the set of documents containing each term in the input list of terms.
-frequency: if we start by intersecting the two smallest postings lists, then all intermediate results must be no bigger than the smallest postings list, and we are therefore likely to do the least amount of total work. So, for the postings lists in Figure 1.3 (page 7), we execute the above query as:
-(Calpurnia AND Brutus) AND Caesar
-This is a first justification for keeping the frequency of terms in the dictionary: it allows us to make this ordering decision based on in-memory data before accessing any postings list.
-Consider now the optimization of more general queries, such as:
-(madding OR crowd) AND (ignoble OR strife) AND (killed OR slain)
-As before, we will get the frequencies for all terms, and we can then (con- servatively) estimate the size of each OR by the sum of the frequencies of its disjuncts. We can then process the query in increasing order of the size of each disjunctive term.
-For arbitrary Boolean queries, we have to evaluate and temporarily store the answers for intermediate expressions in a complex expression. However, in many circumstances, either because of the nature of the query language, or just because this is the most common type of query that users submit, a query is purely conjunctive. In this case, rather than viewing merging post- ings lists as a function with two inputs and a distinct output, it is more ef- ficient to intersect each retrieved postings list with the current intermediate result in memory, where we initialize the intermediate result by loading the postings list of the least frequent term. This algorithm is shown in Figure 1.7. The intersection operation is then asymmetric: the intermediate results list is in memory while the list it is being intersected with is being read from disk. Moreover the intermediate results list is always at least as short as the other list, and in many cases it is orders of magnitude shorter. The postings
-INTERSECT(⟨t1, . . . , tn⟩) 1	terms ← SORTBYINCREASINGFREQUENCY(⟨t1, . . . , tn⟩) 2	result ← postings( f irst(terms)) 3	terms ← rest(terms) 4	while terms ̸= NIL and result ̸= NIL 5	do result ← INTERSECT(result, postings( f irst(terms))) 6	terms ← rest(terms) 7	return result
-Online edition (c)␣2009 Cambridge UP
-page:
-1.3	Processing Boolean queries	13
-intersection can still be done by the algorithm in Figure 1.6, but when the difference between the list lengths is very large, opportunities to use alter- native techniques open up. The intersection can be calculated in place by destructively modifying or marking invalid items in the intermediate results list. Or the intersection can be done as a sequence of binary searches in the long postings lists for each posting in the intermediate results list. Another possibility is to store the long postings list as a hashtable, so that membership of an intermediate result item can be calculated in constant rather than linear or log time. However, such alternative techniques are difficult to combine with postings list compression of the sort discussed in Chapter 5. Moreover, standard postings list intersection operations remain necessary when both terms of a query are very common.
-? Exercise 1.4	[⋆]
-For the queries below, can we still run through the intersection in time O(x + y), where x and y are the lengths of the postings lists for Brutus and Caesar? If not, what can we achieve?
-a. Brutus AND NOT Caesar b. Brutus OR NOT Caesar
-Exercise 1.5	[⋆] Extend the postings merge algorithm to arbitrary Boolean query formulas. What is
-its time complexity? For instance, consider: c. (Brutus OR Caesar) AND NOT (Antony OR Cleopatra)
-Can we always merge in linear time? Linear in what? Can we do better than this?
-Exercise 1.6	[⋆⋆] We can use distributive laws for AND and OR to rewrite queries.
-a. ShowhowtorewritethequeryinExercise1.5intodisjunctivenormalformusing the distributive laws.
-b. Would the resulting query be more or less efficiently evaluated than the original form of this query?
-c. Is this result true in general or does it depend on the words and the contents of the document collection?
-Exercise 1.7	[⋆] Recommend a query processing order for
-d. (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes) given the following postings list sizes:
-Online edition (c)␣2009 Cambridge UP
-page:
-14
-1
-Boolean retrieval
-[⋆]
-1.4
-RANKED RETRIEVAL MODEL FREE TEXT QUERIES
-how could we use the frequency of countrymen in evaluating the best query evaluation order? In particular, propose a way of handling negation in determining the order of query processing.
-Exercise 1.9	[⋆⋆] For a conjunctive query, is processing postings lists in order of size guaranteed to be
-optimal? Explain why it is, or give an example where it isn’t. Exercise 1.10	[⋆⋆]
-Write out a postings merge algorithm, in the style of Figure 1.6 (page 11), for an x OR y query.
-Exercise 1.11	[⋆⋆]
-How should the Boolean query x AND NOT y be handled? Why is naive evaluation of this query normally very expensive? Write out a postings merge algorithm that evaluates this query efficiently.
-The extended Boolean model versus ranked retrieval
-The Boolean retrieval model contrasts with ranked retrieval models such as the vector space model (Section 6.3), in which users largely use free text queries, that is, just typing one or more words rather than using a precise language with operators for building up query expressions, and the system decides which documents best satisfy the query. Despite decades of academic re- search on the advantages of ranked retrieval, systems implementing the Boo- lean retrieval model were the main or only search option provided by large commercial information providers for three decades until the early 1990s (ap- proximately the date of arrival of the World Wide Web). However, these systems did not have just the basic Boolean operations (AND, OR, and NOT) which we have presented so far. A strict Boolean expression over terms with an unordered results set is too limited for many of the information needs that people have, and these systems implemented extended Boolean retrieval models by incorporating additional operators such as term proximity oper- ators. A proximity operator is a way of specifying that two terms in a query
-PROXIMITY OPERATOR
-Term
-eyes kaleidoscope marmalade skies tangerine trees
-Exercise 1.8
-Postings size
-213312 87009 107913 271658 46653 316812
-If the query is: e. friends AND romans AND (NOT countrymen)
-Online edition (c)␣2009 Cambridge UP
-page:
-1.4	The extended Boolean model versus ranked retrieval	15
-must occur close to each other in a document, where closeness may be mea- sured by limiting the allowed number of intervening words or by reference to a structural unit such as a sentence or paragraph.
-✎ Example 1.1: Commercial Boolean searching: Westlaw.	Westlaw (http://www.westlaw.com/) is the largest commercial legal search service (in terms of the number of paying sub- scribers), with over half a million subscribers performing millions of searches a day over tens of terabytes of text data. The service was started in 1975. In 2005, Boolean
-search (called “Terms and Connectors” by Westlaw) was still the default, and used by a large percentage of users, although ranked free text querying (called “Natural Language” by Westlaw) was added in 1992. Here are some example Boolean queries on Westlaw:
-Information need: Information on the legal theories involved in preventing the disclosure of trade secrets by employees formerly employed by a competing company. Query: "trade secret" /s disclos! /s prevent /s employe!
-Information need: Requirements for disabled people to be able to access a work- place. Query: disab! /p access! /s work-site work-place (employment /3 place)
-Information need: Cases about a host’s responsibility for drunk guests. Query: host! /p (responsib! liab!) /p (intoxicat! drunk!) /p guest
-Note the long, precise queries and the use of proximity operators, both uncommon in web search. Submitted queries average about ten words in length. Unlike web search conventions, a space between words represents disjunction (the tightest bind- ing operator), & is AND and /s, /p, and /k ask for matches in the same sentence, same paragraph or within k words respectively. Double quotes give a phrase search (consecutive words); see Section 2.4 (page 39). The exclamation mark (!) gives a trail- ing wildcard query (see Section 3.2, page 51); thus liab! matches all words starting with liab. Additionally work-site matches any of worksite, work-site or work site; see Section 2.2.1 (page 22). Typical expert queries are usually carefully defined and incre- mentally developed until they obtain what look to be good results to the user.
-Many users, particularly professionals, prefer Boolean query models. Boolean queries are precise: a document either matches the query or it does not. This of- fers the user greater control and transparency over what is retrieved. And some do- mains, such as legal materials, allow an effective means of document ranking within a Boolean model: Westlaw returns documents in reverse chronological order, which is in practice quite effective. In 2007, the majority of law librarians still seem to rec- ommend terms and connectors for high recall searches, and the majority of legal users think they are getting greater control by using them. However, this does not mean that Boolean queries are more effective for professional searchers. Indeed, ex- perimenting on a Westlaw subcollection, Turtle (1994) found that free text queries produced better results than Boolean queries prepared by Westlaw’s own reference librarians for the majority of the information needs in his experiments. A general problem with Boolean search is that using AND operators tends to produce high pre- cision but low recall searches, while using OR operators gives low precision but high recall searches, and it is difficult or impossible to find a satisfactory middle ground.
-In this chapter, we have looked at the structure and construction of a basic
-Online edition (c)␣2009 Cambridge UP
-page:
-16
-1	Boolean retrieval
-inverted index, comprising a dictionary and postings lists. We introduced the Boolean retrieval model, and examined how to do efficient retrieval via linear time merges and simple query optimization. In Chapters 2–7 we will consider in detail richer query models and the sort of augmented index struc- tures that are needed to handle them efficiently. Here we just mention a few of the main additional things we would like to be able to do:
-1. We would like to better determine the set of terms in the dictionary and to provide retrieval that is tolerant to spelling mistakes and inconsistent choice of words.
-2. Itisoftenusefultosearchforcompoundsorphrasesthatdenoteaconcept such as “operating system”. As the Westlaw examples show, we might also wish	to	do	proximity	queries	such	as	Gates	N E A R	Microsoft.	To	answer such queries, the index has to be augmented to capture the proximities of terms in documents.
-3. A Boolean model only records term presence or absence, but often we would like to accumulate evidence, giving more weight to documents that have a term several times as opposed to ones that contain it only once. To be able to do this we need term frequency information (the number of times a term occurs in a document) in postings lists.
-4. Boolean queries just retrieve a set of matching documents, but commonly we wish to have an effective method to order (or “rank”) the returned results. This requires having a mechanism for determining a document score which encapsulates how good a match a document is for a query.
-With these additional ideas, we will have seen most of the basic technol- ogy that supports ad hoc searching over unstructured information. Ad hoc searching over documents has recently conquered the world, powering not only web search engines but the kind of unstructured search that lies behind the large eCommerce websites. Although the main web search engines differ by emphasizing free text querying, most of the basic issues and technologies of indexing and querying remain the same, as we will see in later chapters. Moreover, over time, web search engines have added at least partial imple- mentations of some of the most popular operators from extended Boolean models: phrase search is especially popular and most have a very partial implementation of Boolean operators. Nevertheless, while these options are liked by expert searchers, they are little used by most people and are not the main focus in work on trying to improve web search engine performance.
-TERM FREQUENCY
-? Exercise 1.12	[⋆] Write a query using Westlaw syntax which would find any of the words professor,
-teacher, or lecturer in the same sentence as a form of the verb explain.
-Online edition (c)␣2009 Cambridge UP
-page:
-1.5	References and further reading	17
-Exercise 1.13	[⋆]
-Try using the Boolean search features on a couple of major web search engines. For instance, choose a word, such as burglar, and submit the queries (i) burglar, (ii) burglar AND burglar, and (iii) burglar OR burglar. Look at the estimated number of results and top hits. Do they make sense in terms of Boolean logic? Often they haven’t for major search engines. Can you make sense of what is going on? What about if you try different words? For example, query for (i) knight, (ii) conquer, and then (iii) knight OR conquer. What bound should the number of results from the first two queries place on the third query? Is this bound observed?
-1.5	References and further reading
-The practical pursuit of computerized information retrieval began in the late 1940s (Cleverdon 1991, Liddy 2005). A great increase in the production of scientific literature, much in the form of less formal technical reports rather than traditional journal articles, coupled with the availability of computers, led to interest in automatic document retrieval. However, in those days, doc- ument retrieval was always based on author, title, and keywords; full-text search came much later.
-The article of Bush (1945) provided lasting inspiration for the new field:
-“Consider a future device for individual use, which is a sort of mech- anized private file and library. It needs a name, and, to coin one at random, ‘memex’ will do. A memex is a device in which an individual stores all his books, records, and communications, and which is mech- anized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory.”
-The term Information Retrieval was coined by Calvin Mooers in 1948/1950 (Mooers 1950).
-In 1958, much newspaper attention was paid to demonstrations at a con- ference (see Taube and Wooster 1958) of IBM “auto-indexing” machines, based primarily on the work of H. P. Luhn. Commercial interest quickly gravitated towards Boolean retrieval systems, but the early years saw a heady debate over various disparate technologies for retrieval systems. For example Moo- ers (1961) dissented:
-“It is a common fallacy, underwritten at this date by the investment of several million dollars in a variety of retrieval hardware, that the al- gebra of George Boole (1847) is the appropriate formalism for retrieval system design. This view is as widely and uncritically accepted as it is wrong.”
-The observation of AND vs. OR giving you opposite extremes in a precision/ recall tradeoff, but not the middle ground comes from (Lee and Fox 1988).
-Online edition (c)␣2009 Cambridge UP
-page:
-18
-REGULAR EXPRESSIONS
-:1	Boolean retrieval
-The book (Witten et al. 1999) is the standard reference for an in-depth com- parison of the space and time efficiency of the inverted index versus other possible data structures; a more succinct and up-to-date presentation ap- pears in Zobel and Moffat (2006). We further discuss several approaches in Chapter 5.
-Friedl (2006) covers the practical usage of regular expressions for searching. The underlying computer science appears in (Hopcroft et al. 2000).
-Online edition (c)␣2009 Cambridge UP
-
-.. [2] Formally, we take the transpose of the matrix to be able to get the terms as column vectors.
-.. [3] Some information retrieval researchers prefer the term inverted file, but expressions like in- dex construction and index compression are much more common than inverted file construction and inverted file compression. For consistency, we use (inverted) index throughout this book. 
-.. [4] In a (non-positional) inverted index, a posting is just a document ID, but it is inherently associated with a term, via the postings list it is placed on; sometimes we will also talk of a (term, docID) pair as a posting.
-.. [5] Unix users can note that these steps are similar to use of the sort and then uniq commands.
+.. END

source/chapter01/boolean_retrieval.rst

    :maxdepth: 1
 
    an_example_information_retrieval_problem
-..   a_first_take_at_building_an_inverted_index
-..   processing_boolean_queries
-..   the_extended_boolean_model_versus_ranked_retrieval
-..   references_and_further_reading
+   a_first_take_at_building_an_inverted_index
+   processing_boolean_queries
+   the_extended_boolean_model_versus_ranked_retrieval
+   references_and_further_reading
 
 
 .. In modern parlance, the word “search” has tended to replace “(information) retrieval”; 
 .. [1] 近代的な用語では、"検索 (search)" という単語が "(情報)検索" に置き換わる傾向にあります。\
        "検索" という語はかなり曖昧ですが、私たちは文脈の中で2つを同じ意味で使用します。
 
-.. __END__
+.. END

source/chapter01/processing_boolean_queries.rst

+.. _processing_boolean_queries:
+
+Processing Boolean queries
+==========================
+
+.. How do we process a query using an inverted index and the basic Boolean retrieval model?
+   Consider processing the simple conjunctive query:
+
+ (1.1) Brutus AND Calpurnia
+
+.. over the inverted index partially shown in Figure 1.3 (page 7). We:
+
+1. Locate Brutus in the Dictionary
+2. Retrieve its postings
+3. Locate Calpurnia in the Dictionary
+4. Retrieve its postings
+5. Intersect the two postings lists, as shown in Figure 1.5.
+
+.. The intersection operation is the crucial one: we need to efficiently intersect postings
+   lists so as to be able to quickly find documents that contain both terms. (This operation
+   is sometimes referred to as merging postings lists: this slightly counterintuitive name
+   reflects using the term merge algorithm for a general family of algorithms that combine
+   multiple sorted lists by interleaved advancing of pointers through each; here we are
+   merging the lists with a logical AND operation.)
+
+.. There is a simple and effective method of intersecting postings lists using the merge
+   algorithm (see Figure 1.6): we maintain pointers into both lists and walk through the two
+   postings lists simultaneously, in time linear in the total number of postings entries.
+   At each step, we compare the docID pointed to by both pointers. If they are the same,
+   we put that docID in the results list, and advance both pointers. Otherwise we advance
+   the pointer pointing to the smaller docID. If the lengths of the postings lists are x and
+   y, the intersection takes O(x + y) operations. Formally, the complexity of querying is Θ(N),
+   where N is the number of documents in the collection.6 Our indexing methods gain us just a
+   constant, not a difference in Θ time complexity compared to a linear scan, but in practice
+   the constant is huge. To use this algorithm, it is crucial that postings be sorted by a single
+   global ordering. Using a numeric sort by docID is one simple way to achieve this.
+
+.. SIMPLE CONJUNCTIVE QUERIES
+.. POSTINGS LIST INTERSECTION
+.. POSTINGS MERGE
+.. QUERY OPTIMIZATION
+
+.. Figure 1.6 Algorithm for the intersection of two postings lists p1 and p2.
+
+.. We can extend the intersection operation to process more complicated queries like:
+
+ (1.2) (Brutus OR Caesar) AND NOT Calpurnia
+
+.. Query optimization is the process of selecting how to organize the work of answering
+   a query so that the least total amount of work needs to be done by the system.
+   A major element of this for Boolean queries is the order in which postings lists are
+   accessed. What is the best order for query processing? Con- sider a query that is an AND of
+   t terms, for instance:
+
+(1.3) Brutus AND Caesar AND Calpurnia
+
+.. For each of the t terms, we need to get its postings, then AND them together.
+   The standard heuristic is to process terms in order of increasing document frequency:
+   if we start by intersecting the two smallest postings lists, then all intermediate
+   results must be no bigger than the smallest postings list, and we are therefore likely
+   to do the least amount of total work. So, for the postings lists in Figure 1.3 (page 7),
+   we execute the above query as:
+
+(1.4) (Calpurnia AND Brutus) AND Caesar
+
+.. Figure 1.7  Algorithm for conjunctive queries that returns the set of documents containing
+   each term in the input list of terms.
+
+.. This is a first justification for keeping the frequency of terms in the dictionary: it allows
+   us to make this ordering decision based on in-memory data before accessing any postings list.
+   Consider now the optimization of more general queries, such as:
+
+(1.5) (madding OR crowd) AND (ignoble OR strife) AND (killed OR slain)
+
+.. As before, we will get the frequencies for all terms, and we can then (conservatively)
+   estimate the size of each OR by the sum of the frequencies of its disjuncts. We can then
+   process the query in increasing order of the size of each disjunctive term.
+
+.. For arbitrary Boolean queries, we have to evaluate and temporarily store the answers for
+   intermediate expressions in a complex expression. However, in many circumstances, either
+   because of the nature of the query language, or just because this is the most common type of
+   query that users submit, a query is purely conjunctive. In this case, rather than viewing
+   merging postings lists as a function with two inputs and a distinct output, it is more
+   efficient to intersect each retrieved postings list with the current intermediate result in
+   memory, where we initialize the intermediate result by loading the postings list of the least
+   frequent term. This algorithm is shown in Figure 1.7. The intersection operation is then
+   asymmetric: the intermediate results list is in memory while the list it is being intersected
+   with is being read from disk. Moreover the intermediate results list is always at least
+   as short as the other list, and in many cases it is orders of magnitude shorter. The postings
+   intersection can still be done by the algorithm in Figure 1.6, but when the difference between
+   the list lengths is very large, opportunities to use alternative techniques open up.
+   The intersection can be calculated in place by destructively modifying or marking invalid
+   items in the intermediate results list. Or the intersection can be done as a sequence of
+   binary searches in the long postings lists for each posting in the intermediate results list.
+   Another possibility is to store the long postings list as a hashtable, so that membership of
+   an intermediate result item can be calculated in constant rather than linear or log time.
+   However, such alternative techniques are difficult to combine with postings list compression
+   of the sort discussed in Chapter 5. Moreover, standard postings list intersection operations
+   remain necessary when both terms of a query are very common.
+
+
+.. ?
+.. Exercise 1.4  [⋆]
+.. For the queries below, can we still run through the intersection in time O(x + y), where x and
+   y are the lengths of the postings lists for Brutus and Caesar? If not, what can we achieve?
+
+a. Brutus AND NOT Caesar
+b. Brutus OR NOT Caesar
+
+.. Exercise 1.5  [⋆]
+.. Extend the postings merge algorithm to arbitrary Boolean query formulas. What is
+   its time complexity? For instance, consider: 
+
+c. (Brutus OR Caesar) AND NOT (Antony OR Cleopatra)
+
+.. Can we always merge in linear time? Linear in what? Can we do better than this?
+
+.. Exercise 1.6  [⋆⋆]
+.. We can use distributive laws for AND and OR to rewrite queries.
+
+a. ShowhowtorewritethequeryinExercise1.5intodisjunctivenormalformusing the distributive laws.
+b. Would the resulting query be more or less efficiently evaluated than the original form of
+   this query?
+c. Is this result true in general or does it depend on the words and the contents of the document
+   collection?
+
+.. Exercise 1.7  [⋆]
+.. Recommend a query processing order for
+
+d. (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes) 
+
+.. given the following postings list sizes:
+
+   Term          Postings size
+   eyes                 213312
+   kaleidoscope          87009
+   marmalade            107913
+   skies                271658
+   tangerine             46653
+   trees                316812
+
+.. Exercise 1.8
+.. If the query is:
+
+e. friends AND romans AND (NOT countrymen)
+
+.. how could we use the frequency of countrymen in evaluating the best query evaluation order?
+   In particular, propose a way of handling negation in determining the order of query processing.
+
+.. Exercise 1.9  [⋆⋆]
+.. For a conjunctive query, is processing postings lists in order of size guaranteed to be
+   optimal? Explain why it is, or give an example where it isn’t.
+
+.. Exercise 1.10  [⋆⋆]
+.. Write out a postings merge algorithm, in the style of Figure 1.6 (page 11), for an x OR y query.
+
+.. Exercise 1.11  [⋆⋆]
+.. How should the Boolean query x AND NOT y be handled? Why is naive evaluation of this query
+   normally very expensive? Write out a postings merge algorithm that evaluates this query
+   efficiently.
+
+.. [6] The notation Θ(·) is used to express an asymptotically tight bound on the complexity of
+   an algorithm. Informally, this is often written as O(·), but this notation really expresses
+   an asymptotic upper bound, which need not be tight (Cormen et al. 1990).
+
+.. END

source/chapter01/references_and_further_reading.rst

+.. _references_and_further_reading:
+
+References and further reading
+==============================
+
+.. The practical pursuit of computerized information retrieval began in the late 1940s
+   (Cleverdon 1991, Liddy 2005). A great increase in the production of scientific literature,
+   much in the form of less formal technical reports rather than traditional journal articles,
+   coupled with the availability of computers, led to interest in automatic document retrieval.
+   However, in those days, doc- ument retrieval was always based on author, title, and keywords;
+   full-text search came much later.
+
+.. The article of Bush (1945) provided lasting inspiration for the new field:
+
+  “Consider a future device for individual use, which is a sort of mechanized private file and
+  library. It needs a name, and, to coin one at random, ‘memex’ will do. A memex is a device
+  in which an individual stores all his books, records, and communications, and which is
+  mechanized so that it may be consulted with exceeding speed and flexibility. It is an
+  enlarged intimate supplement to his memory.”
+
+.. The term Information Retrieval was coined by Calvin Mooers in 1948/1950 (Mooers 1950).
+
+.. In 1958, much newspaper attention was paid to demonstrations at a conference (see Taube
+   and Wooster 1958) of IBM “auto-indexing” machines, based primarily on the work of H. P. Luhn.
+   Commercial interest quickly gravitated towards Boolean retrieval systems, but the early years
+   saw a heady debate over various disparate technologies for retrieval systems. For example
+   Mooers (1961) dissented:
+
+  “It is a common fallacy, underwritten at this date by the investment of several million
+  dollars in a variety of retrieval hardware, that the algebra of George Boole (1847) is
+  the appropriate formalism for retrieval system design. This view is as widely and
+  uncritically accepted as it is wrong.”
+
+.. REGULAR EXPRESSIONS
+
+.. The observation of AND vs. OR giving you opposite extremes in a precision/ recall tradeoff,
+   but not the middle ground comes from (Lee and Fox 1988).
+   The book (Witten et al. 1999) is the standard reference for an in-depth comparison of
+   the space and time efficiency of the inverted index versus other possible data structures;
+   a more succinct and up-to-date presentation ap- pears in Zobel and Moffat (2006).
+   We further discuss several approaches in Chapter 5.
+
+.. Friedl (2006) covers the practical usage of regular expressions for searching.
+   The underlying computer science appears in (Hopcroft et al. 2000).
+
+
+.. END

source/chapter01/the_extended_boolean_model_versus_ranked_retrieval.rst

+.. _the_extended_boolean_model_versus_ranked_retrieval:
+
+The extended Boolean model versus ranked retrieval
+==================================================
+
+.. RANKED RETRIEVAL MODEL FREE TEXT QUERIES
+.. PROXIMITY OPERATOR
+
+.. The Boolean retrieval model contrasts with ranked retrieval models such as the vector
+   space model (Section 6.3), in which users largely use free text queries, that is, just
+   typing one or more words rather than using a precise language with operators for building
+   up query expressions, and the system decides which documents best satisfy the query.
+   Despite decades of academic research on the advantages of ranked retrieval, systems
+   implementing the Boolean retrieval model were the main or only search option provided by
+   large commercial information providers for three decades until the early 1990s
+   (approximately the date of arrival of the World Wide Web). However, these systems did not
+   have just the basic Boolean operations (AND, OR, and NOT) which we have presented so far.
+   A strict Boolean expression over terms with an unordered results set is too limited for
+   many of the information needs that people have, and these systems implemented extended
+   Boolean retrieval models by incorporating additional operators such as term proximity
+   operators. A proximity operator is a way of specifying that two terms in a query must occur
+   close to each other in a document, where closeness may be measured by limiting the allowed
+   number of intervening words or by reference to a structural unit such as a sentence or
+   paragraph.
+
+.. Example 1.1: Commercial Boolean searching: Westlaw.
+.. Westlaw (http://www.westlaw.com/) is the largest commercial legal search service (in terms
+   of the number of paying sub- scribers), with over half a million subscribers performing
+   millions of searches a day over tens of terabytes of text data. The service was started
+   in 1975.
+   In 2005, Boolean search (called “Terms and Connectors” by Westlaw) was still the default,
+   and used by a large percentage of users, although ranked free text querying (called
+   “Natural Language” by Westlaw) was added in 1992. Here are some example Boolean queries on
+   Westlaw:
+
+    *Information need:* Information on the legal theories involved in preventing the
+    disclosure of trade secrets by employees formerly employed by a competing company.
+    *Query:* "trade secret" /s disclos! /s prevent /s employe!
+
+    *Information need:* Requirements for disabled people to be able to access a workplace.
+    *Query:* disab! /p access! /s work-site work-place (employment /3 place)
+
+    *Information need:* Cases about a host’s responsibility for drunk guests.
+    *Query:* host! /p (responsib! liab!) /p (intoxicat! drunk!) /p guest
+
+.. Note the long, precise queries and the use of proximity operators, both uncommon in web
+   search. Submitted queries average about ten words in length. Unlike web search conventions,
+   a space between words represents disjunction (the tightest binding operator), & is AND and
+   /s, /p, and /k ask for matches in the same sentence, same paragraph or within k words
+   respectively. Double quotes give a phrase search (consecutive words); see Section 2.4 (page 39).
+   The exclamation mark (!) gives a trail- ing wildcard query (see Section 3.2, page 51);
+   thus liab! matches all words starting with liab. Additionally work-site matches any of worksite,
+   work-site or work site; see Section 2.2.1 (page 22). Typical expert queries are usually
+   carefully defined and incre- mentally developed until they obtain what look to be good results
+   to the user.
+
+.. Many users, particularly professionals, prefer Boolean query models. Boolean queries are
+   precise: a document either matches the query or it does not. This of- fers the user greater
+   control and transparency over what is retrieved. And some domains, such as legal materials,
+   allow an effective means of document ranking within a Boolean model: Westlaw returns documents
+   in reverse chronological order, which is in practice quite effective. In 2007, the majority of
+   law librarians still seem to rec- ommend terms and connectors for high recall searches,
+   and the majority of legal users think they are getting greater control by using them.
+   However, this does not mean that Boolean queries are more effective for professional searchers.
+   Indeed, experimenting on a Westlaw subcollection, Turtle (1994) found that free text queries
+   produced better results than Boolean queries prepared by Westlaw’s own reference librarians
+   for the majority of the information needs in his experiments. A general problem with
+   Boolean search is that using AND operators tends to produce high precision but low recall
+   searches, while using OR operators gives low precision but high recall searches, and it is
+   difficult or impossible to find a satisfactory middle ground.
+
+.. In this chapter, we have looked at the structure and construction of a basic inverted index,
+   comprising a dictionary and postings lists. We introduced the Boolean retrieval model, and
+   examined how to do efficient retrieval via linear time merges and simple query optimization.
+   In Chapters 2–7 we will consider in detail richer query models and the sort of augmented
+   index structures that are needed to handle them efficiently. Here we just mention a few of
+   the main additional things we would like to be able to do:
+
+1. We would like to better determine the set of terms in the dictionary and to provide retrieval
+   that is tolerant to spelling mistakes and inconsistent choice of words.
+
+2. It is often useful to search for compounds or phrases that denote a concept such as
+   “operating system”. As the Westlaw examples show, we might also wish to do proximity queries
+   such as Gates NEAR Microsoft. To answer such queries, the index has to be augmented to
+   capture the proximities of terms in documents.
+
+3. A Boolean model only records term presence or absence, but often we would like to accumulate
+   evidence, giving more weight to documents that have a term several times as opposed to ones
+   that contain it only once. To be able to do this we need term frequency information (the number
+   of times a term occurs in a document) in postings lists.
+
+4. Boolean queries just retrieve a set of matching documents, but commonly we wish to have
+   an effective method to order (or “rank”) the returned results. This requires having a
+   mechanism for determining a document score which encapsulates how good a match a document
+   is for a query.
+
+.. With these additional ideas, we will have seen most of the basic technol- ogy that supports
+   ad hoc searching over unstructured information. Ad hoc searching over documents has recently
+   conquered the world, powering not only web search engines but the kind of unstructured
+   search that lies behind the large eCommerce websites. Although the main web search engines
+   differ by emphasizing free text querying, most of the basic issues and technologies of
+   indexing and querying remain the same, as we will see in later chapters. Moreover, over time,
+   web search engines have added at least partial implementations of some of the most popular
+   operators from extended Boolean models: phrase search is especially popular and most have
+   a very partial implementation of Boolean operators. Nevertheless, while these options are
+   liked by expert searchers, they are little used by most people and are not the main focus
+   in work on trying to improve web search engine performance.
+
+.. TERM FREQUENCY
+
+.. ?
+.. Exercise 1.12  [⋆]
+.. Write a query using Westlaw syntax which would find any of the words professor,
+   teacher, or lecturer in the same sentence as a form of the verb explain.
+
+.. Exercise 1.13  [⋆]
+.. Try using the Boolean search features on a couple of major web search engines.
+   For instance, choose a word, such as burglar, and submit the queries (i) burglar,
+   (ii) burglar AND burglar, and (iii) burglar OR burglar. Look at the estimated number of
+   results and top hits. Do they make sense in terms of Boolean logic? Often they haven’t
+   for major search engines. Can you make sense of what is going on? What about if you try
+   different words? For example, query for (i) knight, (ii) conquer, and then (iii) knight
+   OR conquer. What bound should the number of results from the first two queries place on
+   the third query? Is this bound observed?
+
+.. __END__