Web crawler and indexer
**Project for Information Retrieval course**
**By: Anton Nechaev, Yuri Prezument**
Uses the Porter2 stemming algorithm, see porter2.py
The crawler is using BFS, first the seed page is added to the queue, scanned, and if contains
at least on of the query terms, every anchor link on the page is added to the queue. The same
process is repeated for every page in the queue until the limit (50 by default) is reached or the
queue is empty.
The crawler completely ignores robots.txt files.
Pages with a mime-type other then text or xml are ignored to avoid downloading large non-
text files like PDF's or archives.
The results (the pages visited, and the inverted index) are written to an html file.
The parsing is done using Beautiful Soup.
Compatible with Python 2.6 and 2.7.
crawler.py seed query [max-pages] [index-file]
'max-pages' and 'index-file' are optional and default to 50 and "out.html" respectively.