Overview

.. Repository located at: https://bitbucket.org/alxbl/mycrawl
   Authors: 
            Alexandre Beaulieu <alxbl03@gmail.com>
            Pier-Luc Gagnon <gagnon.pierluc@gmail.com>
            Darren Malek <darren.malek@mail.mcgill.ca>
            
    Unless mentioned otherwise, the code is Copyright (C) 2011 to the above
    authors. See LICENSE for more detail.

myCrawl
=======

``myCrawl`` is a simple distributed crawling framework. It aims to be modular 
enough to be used with multiple different sites, although it is still in very 
early stages. The full specification document is included in the ``doc/`` folder 
and should be consulted for more detail about how myCrawl works. This README 
only aims to give a basic understanding necessary to setup and run the system.

``myCrawl`` requires Python 3.0 or higher to run and will not launch if your
version of the python interpreter is older. To launch the crawler, simply type::

    ./myCrawl
    
in the root package folder. ``myCrawl`` will use the configuration files to
automatically detect what node type it should select and how it should operate.

1. Configuring a Node
---------------------

The system needs at least two nodes to operate. Each node in myCrawl is its own
installation of the crawler and must be configured independently of the others.

You must have a single master node, and any number of slave nodes. Slave nodes
do not have to be rooted at the master, and they can have children nodes
themselves. This allows to distribute the work in a very efficient manner.

There are 3 very important options to configure independently for each node::

    CONFIG = {
        # ...
        
        # The local interface to bind the node to.
        #
        # e.g. ('0.0.0.0', 45453)
        'BIND_INTERFACE': ('0.0.0.0', 40500),
        
        # A list of 2-tuples ('ip', port) for all the children nodes below this one.
        # If the node is a slave node (leaf), then the list should be empty.
        #
        # e.g. [('233.215.23.62', 45453),('233.210.33.3', 45453)]
        # e.g. None
        'CHILDREN_NODES': [('localhost', 40502),('localhost', 40503)],
        
        # The address of this node's parent. If the node is the master, then it should
        # have this field set to None.
        #
        # e.g. None
        # e.g. ('215.132.126.6', 45453)
        'PARENT_NODE': None, # None for master,
        
        # ...
    }

Once all the nodes are started, it is important to start the slave nodes first,
followed by hybrid nodes, and lastly followed by the master node.

2. Data Storage
---------------

Collected data goes in the SQLite3 database file specified in the settings.py 
file. The database structure is as follows::

    CREATE TABLE IF NOT EXISTS crawl (
        uid           INTEGER, 
        crawl_date    TEXT, 
        friend_count  INTEGER, 
        gender        TEXT, 
        age           INTEGER, 
        last_activity TEXT, 
        type          TEXT, 
        PRIMARY KEY   (uid, crawl_date)  
    )
    
    CREATE TABLE IF NOT EXISTS meta (
        last_user        INTEGER, 
        last_crawl       INTEGER, 
        last_active_user INTEGER, 
        FOREIGN KEY      (last_user)        REFERENCES crawl(uid), 
        FOREIGN KEY      (last_active_user) REFERENCES crawl(uid)
    )
    
The database can be used by any client to compute aggregates or process the
information in the desired way.

3. Extra Documentation
----------------------

You can consult architectural diagrams in the ``docs/`` folder. Additionally, 
you should take a look at the functional specification document located at
``docs/spec.pdf`` in order to see what the project aims to implement and what is
currently implemented. The specification document also includes a section on
future work that could expand on this crawling framework.

4. Forking and Expanding
------------------------

By all means, feel free to fork the project and experiment with it by yourself.
It is currently in a very early stage and very specific to a certain target.

The most lacking part at the moment is the inflexibility of the database store,
which could be extended to allow dynamically changing information from a wide
variety of targets.

For any additional questions, please do not hesitate to contact one of the
developers.

5. Compatibility
----------------

``myCrawl`` was tested and seen to work under Windows 7 64bit and Arch Linux 
64bit and does not use any platform dependent Python 3 instructions. As such, it
should in theory be portable to any computer architecture for which a Python 3
interpreter exists.