Bitbucket is a code hosting site with unlimited public and private repositories. We're also free for small teams!

Close
* Intro
searchr is a program that searches a collection of document against the keywords you provide to it.

It has two main functions: indexer(...) and retriever(...)

The indexer(...) function takes 3 arguments:
        1. test collection directory (where you store the document collection);
        2. a stop list name (it comes with the program, it's called "english.stop");
        3. output directory for index files (any directory name you like);

The retriever takes at least 3 arguments of which
        the first is the directory containing the index files,
        the second is the number of documents to be returned,
        the rest are the keywords




* How to run the program
from command line, go to the directory where the searchr.py file is, then enter:
$ python -i searchr.py

You'd then be in the python shell with all the functions ready for you.


** How to invoke the indexer function
After you've entered the python interpreter, enter
>>> indexer(<test_collection_dir>, <stoplist>, <output_dir>)

where
<test_collection_dir> is the path to the test collection directory,
<stoplist> is the name of the stop list, which is provided for you, it's called english.stop
<output_dir> is the path to the directory where you want to store the index files.

These 3 function arguments must be strings; and if it's a directory, it must have a trailing slash.

For example, if the test collection directory is test_collection/sci.spsace/,
the stop list is english.stop and the desired output directory is postings/;
then invoke the indexer with
>>> indexer('test_collection/sci.spsace/', 'english.stop', 'postings/')

NOTE: a directory must have a trailing slash, as "postings/", not "postings"


** How to invoke the retriever
The retriever is invoke the same way as the indexer. As an example, if the postings(index files) are
in the 'postings/' directory, you want the top 3 relevant documents to be returned, and you
have 3 keywords: lore, digit and pigment, then in the python shell enter:
>>> retriever('postings/', 3, 'lore', 'digit', 'pigment')


Examples that come with the script
An example comes with the python script, to run it, uncomment line 197 and 270 within the python script,
and then run the script again.

Recent activity

website_scraping

website_scraping began watching qfz/searchr

Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.