- First install setuptools for Python to get easy_install: sudo apt-get install python-setuptools
- Then install pyparsing: easy_install pyparsing
- Unzip the source code in a directory of your choice.
- Make sure that all mapper and reducer scripts located in DocIRHadoop/InvertIndex have permission for execution.
- Open a terminal.
- Export the path to the DocIrHadoop directory in PYTHONPATH environment variable: export PYTHONPATH=/path/to/parent/of/DocIRHadoop:$PYTHONPATH
- Go to parent of DocIRHadoop directory.
- In a second terminal go to Hadoop directory and start Hadoop (bin/start-all.sh).
- In the first terminal type: python DocIRHadoop/run.py
- At the first promt enter the full path of the location of the english-documents directory. Press Enter.
- Enter the name of the destination directory in HDFS. Press Enter.
- At this time you will see a lot of information of the execution of DocIRHadoop, especially for the MapReduce jobs.
- When the jobs for inverted indexing finish you will access the Search section.
- Here you type your queries, you see the job execution info.
- And then you get the result of the query.
- To Exit press Ctrl+D.
- You can now close the Hadoop in the second terminal.
- The Inverted Index is a positional inverted index. To create it the documents must be read in a compressed format, so that Hadoop feed one document per mapper.
- The boolean query supports and, or, not, (, ).
- At the results only the documents names are shown.