1. Bliu Bliu
  2. Untitled project
  3. bliubliu_crawler

Overview

Bliu Bliu API is accessible via HTTP POST method.

To get an API key go to bliubliu.com, register and go to your dashboard (http://bliubliu.com/en/dashboard/). At the bottom of the page you will see “Developer API key” box where you will see a code like:

holQOyQcggxwr1wyUVUxBIjhK3k9ZFAQQgDUyz8PGyA

Now you will be able to access API methods. Currently only one is available - for uploading content.

===

Content uploading

URL

http://bliubliu.com/api/import-text/YOUR-API-KEY/

REQUIRED PARAMS

OPTIONAL PARAMS

  • youtube_url - URL to youtube.com video, if it’s available
  • image - URL to image, if it’s available
  • author - Author of the text
  • tags - a list of tags that might be applied for text (sports, economy, quotes, etc.)

REQUEST STRUCTURE

You need to send HTTP POST request with data parameter, which contains JSON associative array of text data.

JSON DATA EXAMPLE

{
   "body":"Ashes rivals England and Australia will meet on the opening day of the 2015 World Cup after being drawn in the same group.",  
   "locale":"en",
   "tags":[
      "sports"
   ],
   "title":"England face Australia at World Cup",
   "original_source":"http://www.bbc.co.uk/sport/0/cricket/23493953",
   "image":"http://news.bbcimg.co.uk/media/images/69014000/jpg/_69014373_cricketworldcupafp.jpg",
   "collection":"bbc.co.uk"
}

CURL REQUEST EXAMPLE

curl -X POST\
 --data 'data={"body": "Ashes rivals England and Australia will meet on the opening day of the 2015 World Cup after being drawn in the same group.", "locale": "en", "tags": ["sports"], "title": "England face Australia at World Cup", "original_source": "http://www.bbc.co.uk/sport/0/cricket/23493953", "image": "http://news.bbcimg.co.uk/media/images/69014000/jpg/_69014373_cricketworldcupafp.jpg", "collection": "bbc.co.uk"}' \
 http://bliubliu.com/api/import-text/YOUR-API-KEY/

PYTHON REQUEST EXAMPLE (uses python-requests)

import requests
import json

data = json.dumps({
    'title': 'England face Australia at World Cup',
    'body': 'Ashes rivals England and Australia will meet on the opening day of the 2015 World Cup after being drawn in the same group.',
    'original_source': 'http://www.bbc.co.uk/sport/0/cricket/23493953',
    'locale': 'en',
    'collection': bbc.co.uk,
    'image': 'http://news.bbcimg.co.uk/media/images/69014000/jpg/_69014373_cricketworldcupafp.jpg',
    'tags': ['sports'],
})

response = requests.post(
'http://bliubliu.com/api/import-text/YOUR-API-KEY/',
    data={'data': data}
)

print response.status_code
print response.content

RESPONSE STRUCTURE

Response always returns a JSON associative array, containing one element with key msg.

If request was processed successfully, API returns HTTP 200 (OK) status code with JSON: {"msg": "OK"}

If request processing failed, API returns HTTP 400 (BAD REQUEST) status code with JSON where the value of key element is explanation why request failed.

===

Help Bliu Bliu to add more content

How to start scraping with Scrapy

To scrape websites you will need:

  • Terminal app (on Mac OS X and Linux operating systems they are already available, for Windows you should download PuTTY client).
  • Linux server (it’s not recommended to scrape from your own computer, because it’s long-running task and requires good internet connection).

Preparation

Connect to your server (if you do not have a server and want to help us, contact us info@bliubliu.com and we will give you one) with terminal app:

ssh your-username@your-server.com

Install OS (Ubuntu) dependencies:

sudo apt-get install build-essential python-virtualenv
    python-dev libxml2-dev libxslt-dev g++ libyaml-dev mercurial

Clone repository:

hg clone https://bitbucket.org/bliubliu/bliubliu_crawler

Navigate to cloned directory:

cd bliubliu_crawler

Create virtualenv:

virtualenv .

Activate virtualenv:

source bin/activate

Install python dependencies:

pip install -r requirements.txt

Scraping a website

Firstly, open ./bliubliu/bliubliu/settings.py and in line

API_KEY = 'YOUR-API-KEY'

change YOUR-API-KEY with your personal API key.

In directory ./bliubliu/bliubliu/spiders/ you will notice a first demo spider written for you brainyquote_com_spider.py. It is for scraping brainyquote.com website.

Navigate to ./bliubliu/bliubliu/ directory and run command:

scrapy list

You will see:

(bliubliu_crawler)user@server:~/bliubliu_crawler/bliubliu/bliubliu$ scrapy list
www.brainyquote.com

You can run crawler by entering command (where www.brainyquote.com is your spiders name):

scrapy crawl www.brainyquote.com

Now the spider will start to crawl a website and you should see output with response from API:

$ scrapy crawl www.brainyquote.com
API response: [200] OK
API response: [200] OK
API response: [200] OK

you can stop spider by pressing ctrl+c few times.

Writing your own spider

Go to spiders/ directory, duplicate brainyquote_com_spider.py and rename it to have the name of website you are scraping.

Open a file you just created and rename class to website domain you are scraping. Also change name, allowed_domains, start_urls and rules to match the website's structure.

Before starting spider you should also change XPATH selectors to extract correct data from website.

To test your selectors you should use scrapy shell command.

If we want to scrape brainyquotes.com, find a page where only one quote on page appears, for example: http://www.brainyquote.com/quotes/quotes/j/jimmorriso109336.html

Now in shell type:

scrapy shell \
    http://www.brainyquote.com/quotes/quotes/j/jimmorriso109336.html

Now you are in a scrapy shell (to quit press ctrl+d).

You can try your one XPATH selectors by using hxs object.

For example, to extract quote body you try this selector:

In [1]: hxs.select('//div[@class="bq_fq bq_fq_lrg"]/p/text()').extract()[0]
Out[1]: u'Film spectators are quiet vampires.'

After finding the correct selector write it down in your spider's file.

Find the correct selectors for all required fields and test them with the command (change the command to match website you are scraping and your spider):

scrapy parse http://www.brainyquote.com/quotes/quotes/j/jimmorriso109336.html\
 --spider=www.brainyquote.com --nolinks -c parse_item

You should see output like:

>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items  ------------------------------------------------------------
[{'author': u'Jim Morrison',
  'body': u'Film spectators are quiet vampires.',
  'locale': 'en',
  'original_source': 'http://www.brainyquote.com/quotes/quotes/j/jimmorriso109336.html',
  'tags': ['quote'],
  'title': 'Quote'}]

It means, that selectors are correct and you can start your spider and scrape a website.

When you are done

scrapy crawl www.****.com