python crawl_trial.py will launch the crawl accroding to parameters declared in crawl_parameters.yml file or any yaml file declared as an argument of crawl_trial.py. The crawler uses the seed sites found in the list of files of a given repertory (path) as well as a query that will be used to validate new webpages (query) found during the crawling process. inlinks_min define the minimum number of citation a page should accumulate before being considered as a candidate to enter the corpus. depth parameter defines the number of steps of corpus extension performed from the initial corpus made of seeds webpages.
required modules: urllib2,BeautifulSoup, urlparse, sqlite3, pyparsing,urllib, random, multiprocessing,lxml,socket, decruft, feedparser, pattern,warnings, chardet, yaml
* better scrapping of webpages, for the moment decruft is doing great but can be enhanced (http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/)
* automatically extract dates from extracted texts (good python solutions in english, still lacking in french)
* feed the db with cleaner and richer information (like domain name, number of views, etc)
* take into account the charset when crawling webpages
* crawl updating process
* automatic grab google links to initiate a crawl
* TBD grid-compliant code...
* clean the code (debug mode, documentation, etc.)
* write a comprehensive post_processing.py script to keep compatibility with other developments.
* monitoring and reporting (page retrieval problems, success, distributions, etc)
* modular architecture : include a better information extraction process.
* avoid downloading n times the same content. (md5 comparison)
* retry to download pages that could not be opened.
* targeted and careful crawl of each domain (only follow hypertext links with the ~query in the url or in the linkText)