Wiki

Clone wiki

pwc / Home

Welcome

Welcome to the pwc wiki. "pwc" is an acronym for Py Web Client. It aims to be a python implementation of the excellent hurl.it tool.

Python Web Client project proposal

  • inspired by Hurl (which is written in Ruby)
  • as the Hurl home page says: Enter a URL, set some headers, view the response, then share it with others. Perfect for demoing and debugging APIs.

Some requirements

  • Web application should be installable locally (version 2 could be a port to Google App Engine)
  • application should provide a simple way to specify HTTP request elements:
    • the HTTP method (GET, POST, PUT, DELETE, HEAD)
    • the HTTP query string + parameters
    • extra HTTP headers
    • the HTTP request payload (for example for JSON or XML payloads)
  • application should provide a way to save HTTP request operations (including parameters, extra headers, payload etc.) as their own URL, which can then be invoked at a later time to replay that specific operation
  • it would be nice to use AJAX effects for the GUI elements on the main page (drop downs, text fields), just as Hurl does

Some technology ideas

  • Python Web application framework choice: Django, tornado, cherrypy, fapws
  • Data store choice (a key/value data store would be sufficient): redis, tokyo cabinet/tyrant
  • AJAX/Javascript: pyjamas
  • code highlighting: pygments
  • testing: nose, twill

Object Model

Before any coding begins we need to break the application down into its parts:

  • How are we going to break down an input URL?
  • How are we going to store the request method and the query parameters?
  • Will each piece be stored as its own model with granular fields or as fields/attributes of a single model?

Current version notes

Hurl source code

Hurl was recently open-sourced.

Data Storage

  • Like Hurl, I think we should use Redis.
  • Hurl author wrote an article on Sorting data w/ redis.
  • There is a Python library for redis called redis-py. You can also download it from MacPorts or Apt. Install the server too, because the lib is useless without it:
    • MacPorts: sudo port install redis py26-redis
    • Apt: sudo apt-get install redis-server python-redis
  • Case study on using Redis datastore as a database for a Twitter-like webapp.

(Latest push of redis-py just added .decode hook for subclasses to add deserialization logic")

UI Sugar

Redis Data Layout

USERS

Schema:

INCR global:nextUserId => 1000
SET  uid:1000:username 'jathan'
SET  uid:1000:password 'password'
SET  username:jathanism:id 1000
SADD global:users 1000

Sample code:

username = 'jathanism'
password = 'password'
new_user_id = r.incr("global:nextUserId")
r.set("uid:%s":username" % new_user_id, username)
r.set("uid:%s":password" % new_user_id, password)
r.set("username:%s:id" % username, new_user_id)
r.sadd("global:users", new_user_id)
user_uid =  r.get("username:%s:id" % username)

REGISTRATION

NYI

AUTHENTICATION

User id is stored as a secure cookie named 'pwc_auth' upon a successful login.

URLS

NYI - but brainstorming ahead!

URLs information is stored as a JSON blob of relevant information (auth-info, request method, url, and unique id). The unique id is a SHA224 hash of the blob which is then added to the JSON blob *AFTER* hashing but *BEFORE* storage. We will store each URL keyed by its hash.

Each user will have a set of the URLs they have added so they are easily referenced, with the unix timestamp of the creation of that URL as the value. URL hashes need not be unique to each user, but should be unique in the datastore.

Schema:

SETNX "urlhash" "json_blob"
SADD "user:1:urls" "urlhash"
SET "user:1:urls:urlhash" unixtime

Sample code:

import simplejson as json
import hashlib
urljson = """ {
    "auth": "none",
    "method": "GET",
    "url": "http://github.com/api/v2/json/user/show/defunkt"
}"""

urlhash = hashlib.sha224(urljson).hexdigest()
# urlhash => 'da0838427e5fef02921ca2346f59480da2e9e5d9'
urlinfo = json.loads(urljson)
urlinfo['id'] = urlhash
r.set(urlhash, json.dumps(urlinfo), preserve=True)

# add the url to user set
r.sadd("user:1:urls", urlhash)

# store timestamp for urlhash for user
unixtime = int(time.time())
r.set("user:1:urls:%s" % urlhash, unixtime)

Get URL info:

>>> print r.get('da0838427e5fef02921ca2346f59480da2e9e5d9')
{
   "auth": "none",
   "id": "da0838427e5fef02921ca2346f59480da2e9e5d9",
   "method": "GET",
   "url": "http://github.com/api/v2/json/user/show/defunkt"
}

Get the URLs added by a user:

>>> r.keys("user:1:urls:*")
[u'user:1:urls:2e78ea11c344c605c48bb5b607671ca49f9a9c35']

Get the timestamps for those URLs

>>> for u in r.smembers('user:1:urls'):
...     print 'urlhash:', u, 'timestamp:', r.get('user:1:urls:%s' % u)
... 
urlhash: 2e78ea11c344c605c48bb5b607671ca49f9a9c35 timestamp: 1266277164

Using Py-cURL for URL fetching

Curl is used by Hurl. PyCurl is more complex than its counterparts in the standard library (httplib, urllib, urllib2) but much more powerful. Specifically, PyCurl allows for the capturing of full outbound and incoming request headers for display as is done on Hurl. I was unable to figure out how to get outbound headers easily using other solutions.

Using it would go something like this:

import pycurl

# this is a personal site on which the index has a 302 redirect... great for testing!
url = 'http://00bliss.com/'

c = pycurl.Curl()

# follow redirects?
folllow_redirects = 1
c.setopt(pycurl.FOLLOWLOCATION, follow_redirects)

c.setopt(pycurl.URL, url)

# collect headers
sent_headers = list()
received_headers = list()
def collect_headers(debug_type, debug_msg):
    if debug_type == pycurl.INFOTYPE_HEADER_OUT:
            sent_headers.append(debug_msg)
    if debug_type == pycurl.INFOTYPE_HEADER_IN:
            received_headers.append(debug_msg)

#c.setopt(pycurl.VERBOSE, 1)
c.setopt(pycurl.DEBUGFUNCTION, collect_headers)

c.perform()

Which results in:

>>> print ''.join(sent_headers)
GET / HTTP/1.1
User-Agent: PycURL/7.19.5
Host: 00bliss.com
Accept: */*

GET /podcast/oobliss.html HTTP/1.1
User-Agent: PycURL/7.19.5
Host: 00bliss.com
Accept: */*

>>> print ''.join(received_headers)
HTTP/1.1 302 Found
Date: Tue, 16 Feb 2010 16:16:59 GMT
Server: Apache/2.0.54
Location: http://00bliss.com/podcast/oobliss.html
Vary: Accept-Encoding
Content-Length: 291
Content-Type: text/html; charset=iso-8859-1

HTTP/1.1 200 OK
Date: Tue, 16 Feb 2010 16:16:59 GMT
Server: Apache/2.0.54
Last-Modified: Mon, 25 Feb 2008 20:57:31 GMT
ETag: "7d0dec5-2dfa-d20624c0"
Accept-Ranges: bytes
Content-Length: 11770
Vary: Accept-Encoding,User-Agent
Content-Type: text/html

Capturing body: http://pycurl.sourceforge.net/doc/callbacks.html

    ## Callback function invoked when body data is ready
    def body(buf):
        # Print body data to stdout
        import sys
        sys.stdout.write(buf)
        # Returning None implies that all bytes were written

    ## Callback function invoked when header data is ready
    def header(buf):
        # Print header data to stderr
        import sys
        sys.stderr.write(buf)
        # Returning None implies that all bytes were written

    c = pycurl.Curl()
    c.setopt(pycurl.URL, "http://www.python.org/")
    c.setopt(pycurl.WRITEFUNCTION, body)
    c.setopt(pycurl.HEADERFUNCTION, header)
    c.perform()

Model brainstorming

After digging around in the source code for Hurl.it, this seems like overkill. They use a simple redis k/v store and don't put so much emphasis and detail into the storage. A URL is simple a SHA-encoded string. We don't care if they are unique.

URL

  • token (unique identifier)
  • raw_url (self-explanatory)
  • type (http, etc)
  • hostname (fqdn)
  • raw_path (/path/to/whatever?foo=a&bar=b)
  • path (/path/to/whatever)
  • args (m2m of Argument)
  • request_method (enum of GET, POST, PUT, DELETE, HEAD)
  • headers (m2m of Header)
  • auth (pk Authentication)
  • payload (one of Payload)

Header

  • url_token (associated URL)
  • name
  • value

Authentication

  • url_token
  • username
  • password
  • type (HTTP Basic, Digest, None, etc)

Payload

  • url_token
  • content_type (how to process data)
  • data (blob)

User

  • name
  • email
  • password
  • username
  • urls (m2m of URL)

Updated