1. Kevin Burton
  2. peregrine

Source

peregrine / PRESENTATION

- Influenced by:
    - Cassandra
    - GFS
    - MapReduceMerge
    - Twiter
    - Pig 



- The block replication scheme in Hadoop means that there are MORE 3 machine
  pairs which means that you have a HIGHER probabilityy that a small amount of
  data will die vs a low probability that a high amount of data will vanish in
  this system.

- SSD vs HDD

    - SSD drives are starting to really become commodity but are still too
      expensive

    - Seagate 2TB Barracuda XT 7200 RPM $151 ... $0.0755 per GB
    - Seagate 3TB Barracuda XT 7200 RPM $262 ... $0.0873 per GB

    - Intel 320 600GB $1,138  $1.89 per GB
    - Intel 320 300GB $549    $1.83 per GB
    - Intel 320 160GB $300    $1.87 per GB

    - However, SSD drives allow us to *in theory* do some interesting things. 

    

- If key routing is based on mod() we will have problems with sorting the
  resulting partitions during the shuffle/sort stage if the values are already
  sorted.

  Let's say we have the values

    0
    1
    2
    3
    4
    5
    6
    7
    8
    9

  If we use mod2 as our router we're going to end up with two partitions:

    p0:

        1
        3
        5
        7
        9

    p1:
        0
        2
        4
        6
        8


    Now that's FINE ... but what if we used lexicographic sharding of the
    ranges.  For example:

    p0:

        0
        1
        2
        3
        4

    p1:
        5
        6
        7
        8
        9

    Now... what we can do to 'sort' them is that if the mapper() emits values
    that are already sorted then I can simply create a new array with where the
    size is:

        array_size = |p0| + |p1|

    then I can just copy p0 into this new array and then p1 and then I'm done.

    pagerank would NOT benefit from this design right now though.

    though computing node_outdegree would...

    partition = floor( value / nr_partitions );