1. Jonathan Gnagy
  2. spider

Source

spider /

Filename Size Date modified Message
bin
config
vendor/cache
35.1 KB
3.2 KB

JRuby Spider

About

Spider is a fast, powerful, and flexible JRuby site crawler. It leverages the anemone Ruby gem.

Building

Prereqs:

Install JRuby: http://jruby.org/

Grab the source

Download the source:

hg clone https://bitbucket.org/jgnagy/spider
cd spider/vendor/cache

Gem action

Install the required Ruby gems:

jgem install jruby-openssl
jgem install warbler
jgem install w3c_validators
jgem install `ls -1 anemone-*.gem | tail -n 1`

Build it

Build the JAR file

cd ../..
warble compiled jar

Running

Run using Java:

java -jar spider.jar -u http://www.mysite.com

Getting Help

Straight from the app:

java -jar spider.jar -h

From the bitbucket project:

https://bitbucket.org/jgnagy/spider/wiki

From right here:

Usage: spider [options]                                                                                                                                         
    -r, --[no-]robots                Obey robots.txt                                                                                                            
    -c, --[no-]cookies               Accept Cookies
        --cookie-merge cookie        Merge a cookie (beta), implies -c                                                                                                                                                                                                     
    -v, --[no-]verbose               Verbose Output                                                                                                             
    -Q, --[no-]query-strings         Include Query Strings                                                                                                      
    -o, --output filename            Output filename                                                                                                            
    -t, --threads count              Thread pool size per Site Base URL                                                                                         
    -H, --host hostheader            Host Header                                                                                                                
    -u, --url rooturl                Site Base URL                                                                                                              
    -U, --user-agent useragent       User Agent                                                                                                                 
        --exclude pattern            Exclude Pattern for URLs
        --[no-]validate              Validate Document against DTD (beta)
    -A, --analytics-search pattern   Require pattern on every page (beta)
    -h, --help                       Display this screen

Source code

The official source, issue tracker, and wiki can be found at https://bitbucket.org/jgnagy/spider

License

Spider is released under the GPLv3 license.