Overview

HTTPS SSH

TechTrends

TechTrends is the product of a student project. For a live preview go to techtrends.

Requirements

All Software was tested under Ubuntu 13.04 but should also work under other Linux Distributions as long as they fullfill the following requirements:

  • Python 2.7

This installation guide makes use of pip. This is not a requirement as you can also install all packages by hand, although pip is recommended for an easier installation.

Installation

  1. Clone the repository: git clone https://bitbucket.org/RaBrand/techtrends.git
  2. Install all libraries: pip install -r requirements.txt
  3. Create the database: sqlite3 links.db < sql/schema.sql
  4. Install the stopword and wordnet corpus for nltk. For more information read here.

Optional

For better content extraction it is recommended that you install python-boilerpipe. Follow the installation instructions at the github repository. If you don't install boilerpipe, our scripts will fall back to a modified arc90's readability algorithm in pure python.

Configuration

You need to set serveral configuration parameters before you can start running the scripts:

  • DB_FILE: the sqlite db file which you created with supplied schema in sql/schema.sql
  • SIMILARITY_SERVER: path to where the similarity server should be stored
  • CACHE: this is where the preprocessed documents are cached, since this takes very long time, they are not preprocessed every time
  • DEBUG: Determines the if the WebServer is run in debug mode or not. Make sure to deactivate debug mode for production.

Workflow

  1. Fill your database with python start_scraper.py. You need at least 1000 database entries before you can go to the next step. Make sure to not run the script too often or your IP will get banned. We collected about 500 links per week, so you need to run the scripts for at least 2 weeks.
  2. Run python train_server.py. This will the train the similarity server.
  3. Run python start_daemon.py. This will start the similiarty server which you can query.
  4. Start the webserver with python start_webserver.py.

For production you probably want to use a combination of nginx or apache with an application server like uswgi. Point your application server to start_webserver.py.