Source

talks / pandas / Data2.ipynb

{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Second Data Set\n",
      "We'll use a [csv](http://en.wikipedia.org/wiki/Comma-separated_values) file in this second part of the workshop. Pandas has a [read_csv](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) function that get data from an SQL connection to a [DataFrame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).\n",
      "\n",
      "This data is generated by the following SQL:\n",
      "\n",
      "`\n",
      "SELECT\n",
      "    Date(Convert_tz(From_unixtime(o.order_time), 'UTC', 'US/Pacific')) AS date\n",
      "    ,Ifnull(p.product_name, 'GiftCard') AS product\n",
      "    ,Ifnull(c.country_name, '-') AS country\n",
      "\t,oi.order_item_amount AS price\n",
      "    ,oi.order_item_cost AS cost\n",
      "    ,SUM(oi.order_item_quantity) AS qty\n",
      "    ,SUM(oi.order_item_quantity * oi.order_item_amount) AS rev\n",
      "    ,SUM((oi.order_item_amount - oi.order_item_cost) * oi.order_item_quantity) AS gm\n",
      "FROM\n",
      "    orders AS o\n",
      "    JOIN orders_items AS oi USING(order_id)\n",
      "            -- ON o.order_id = oi.order_id\n",
      "    LEFT JOIN products AS p  USING(product_id)\n",
      "           -- ON oi.product_id = p.product_id\n",
      "    LEFT JOIN countries c\n",
      "            ON o.order_shipping_country = c.country_code\n",
      "WHERE\n",
      "    o.order_time >= Unix_timestamp(Convert_tz('2013-01-01','US/Pacific', 'UTC'))\n",
      "    AND o.order_time < Unix_timestamp(\n",
      "                                Convert_tz(\n",
      "                                Date(Convert_tz(now(), 'UTC', 'US/Pacific')),\n",
      "                                'US/Pacific', 'UTC')\n",
      "                                )\n",
      "    AND o.order_type IN (0, 12) -- 0    TYPE_STANDARD, 12    TYPE_SPLIT_PARENT\n",
      "    AND o.order_status NOT IN (0, 11, 12) -- 0 PENDING,11 CANCELLED,12 RETURNED\n",
      "GROUP BY 1,2,3,4,5`\n",
      "\n",
      "\n",
      "\n",
      "The data file is `product_country.sql.csv.bz2`, in the directory you cloned `miki/talks/pandas` into, or download it from [Amazon S3](http://dm-miki.s3.amazonaws.com/talks/product_country.sql.csv.bz2).  \n",
      "\n",
      "Use [read_csv](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) to load the data file to a DataFrame called `df2`.  Set the index columns to `date`, `product`, `country`, `price`, and `cost`, and specify that `date` is to be parsed into proper datetime date values.  Don't forget to specify that the data file is bzip2-compressed."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# You code goes here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# First Look at the Data\n",
      "\n",
      "## `df2` as a Value\n",
      "Since we're in IPython, we can just evaluate `df2` to see what it is."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# You code goes here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## `df2` as Data\n",
      "Use the [head](http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.head.html) method of the `df2` DataFrame to see some actual data."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# You code goes here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Plot the Revenue Series by Day\n",
      "\n",
      "## First, summarize the data\n",
      "\n",
      "Take the `sum()` of revenue per day by using the [groupby](http://pandas.pydata.org/pandas-docs/dev/groupby.html) method on the `df2` Data Frame, to make a new Data Frame named `df3`."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Your code goes here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Now, plot the data\n",
      "\n",
      "Find the total revenue per day by calling the [plot](http://pandas.pydata.org/pandas-docs/stable/visualization.html) method on the `df3.rev` Series."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Your code goes here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#Plot Gross Margin (gm) by Product by Date\n",
      "##First summarize the data\n",
      "`sum()` the data to the level of `date` and `product` using the [groupby](http://pandas.pydata.org/pandas-docs/dev/groupby.html) method on the `df2` Data Frame."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Your code goes here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Put the `product` names into the column headings\n",
      "Use the [unstack](http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.unstack.html) method of the `product_by_date` Data Frame to move the product names from the row labels into the column headings."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Your code goes here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Keep just the Gross Margin\n",
      "Set the `product_by_date` DataFrame to just the subset of columns that hold gross margin (`gm`)"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Your code goes here                          "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Find the top 3 products\n",
      "There are a lot of products.  Use the `sum()` method of the `product_by_date` DataFrame to find total gross margin by product."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Your code goes here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Keep just the top 3 products\n",
      "Update the `product_by_date` DataFrame by setting it to just a subset of the columns, specified as a list of column names."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Your code goes here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Plot the gross margin for the top 3 products\n",
      "Use the [plot](http://pandas.pydata.org/pandas-docs/dev/visualization.html) method of the `product_by_date` DataFrame"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Your code goes here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#If there's time\n",
      "##Pricing policies\n",
      "What is Society6's apparent pricing policy?  Is gross margin invariably a fixed share of revenue, or do, say, different products have different relative gross margins?  \n",
      "\n",
      "To take a first look, let's recreate the full `product_by_date` DataFrame from above"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Your code goes here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now, using [matplotlib.pyplot](http://matplotlib.org/api/pyplot_api.html)'s [scatter](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.scatter) method, let's create a scatterplot of revenue vs. gross margin for each combination of day and product."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Your code goes here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 13
    }
   ],
   "metadata": {}
  }
 ]
}