Source

talks / pandas / Data.ipynb

{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<style>\n",
      "img[alt=Miki], img[alt=Eric] {\n",
      "    width: 50px;\n",
      "}\n",
      "</style>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Pandas\n",
      "------\n",
      "\n",
      "[pandas](http://pandas.pydata.org/) is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#This Workshop\n",
      "We'll show you the basics on how to work with data. First part will (by ![Miki](http://intranet/badge/tebekamiki.jpg \"Miki\")) will focus on cleaning data and second part (by ![Eric](http://intranet/badge/klusmaneric.jpg \"Eric\")) will focus on analytics and visualization and on using pandas for to do QA of Tableau reports.\n",
      "\n",
      "We highly recommend [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do) by Wes McKinney (pandas main developer); it is an excellent book.\n",
      "\n",
      "# IPython Notebook\n",
      "This is an [IPython notebook](http://ipython.org/notebook.html): you can do many cool things with it - including running code!\n",
      "\n",
      "Note that this notebook server was started with `--pylab=inline` for some import shortcuts and inline graphs."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Getting Data\n",
      "We'll use an [sqlite3](http://docs.python.org/2/library/sqlite3.html) database. Pandas has a [read_frame](http://pandas.pydata.org/pandas-docs/dev/generated/pandas.io.sql.read_frame.html) function that get data from an SQL connection to a [DataFrame](http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.html).\n",
      "\n",
      "Download [the database](http://dm-miki.s3.amazonaws.com/talks/edw.sqlite.bz2) and uncompress it. Then use `sqlite3` and `read_frame` to load the table `FACT_OMNITURE_PAGE_VIEWS` to a DataFrame called `df`. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# You code goes here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# First Look at the Data\n",
      "\n",
      "## `df` value"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# You code goes here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Columns"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# You code goes here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Some Values from DAY_KEY, SITE_KEY and IP_ADDRESS"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# You code goes here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Data Types"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# You code goes here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Cleaning Data\n",
      "\n",
      "You should know you data. However, everything that is an `object` is probably a string and need to be checked.\n",
      "\n",
      "For example `EVENT_DTM` (event date time) should be a [datetime](http://docs.python.org/2/library/datetime.html) object.  Check whether that's currently the case or not."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# You code goes here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Fix `EVENT_DTM` column to be of `pandas.Timestamp` object. You'll find [pandas.to_datetime](http://pandas.pydata.org/pandas-docs/dev/generated/pandas.tseries.tools.to_datetime.html#pandas.tseries.tools.to_datetime) helpful (you can ignore the milliseconds for simplicity)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# You code goes here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Cleanup `GOOGLE_RANK` to be a number or `nan`."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# You code goes here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Joining Tables\n",
      "\n",
      "In the current DataFrame, we have only `SITE_KEY`, however we'd like to have the site name as well. Load another DataFrame from the `DIM_SITE` table and join it with the previous data frame on `SITE_KEY`.\n",
      "\n",
      "See [pandas.merge](http://pandas.pydata.org/pandas-docs/dev/merging.html#database-style-dataframe-joining-merging) function."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# You code goes here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Initial Insights\n",
      "\n",
      "## Most Popular Site\n",
      "\n",
      "Find the most popular site by using [pandas.value_counts](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html#pandas.Series.value_counts) on `PROPERTY_NAME`."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Your code goes here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Busiest Hour\n",
      "Find the busiest hour."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Your code goes here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Saving\n",
      "\n",
      "Save `joined` to disk to a file called `joined.p`"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Your code goes here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 12
    }
   ],
   "metadata": {}
  }
 ]
}