explanation

Issue #1 new
Sandeep Srinivasa created an issue

There are rows in the postgres database that point to CSV files in s3. These s3 files are generated every hour. They represent a connection between two people. Each person is identified by phone number.

Now, the connection between two people may be repeated in subsequent CSV files - but we will create a new edge each and every time it comes. This in some way "reinforces"/"strengthens" the connection between two people. We annotate the date on this edge. This way we can ask question - what are all the connection between DATE1 and DATE2 ?

Reading the CSV file needs some ETL cleanup - because phone numbers in india can come with +91, 0091, 0, etc. Some files are in UTF-8 encoding, etc.

This graph then needs to be loaded into neo4j for some OLTP queries.

Scope of project

  • We have implemented this in R already. You will be given that code to see how the ETL is supposed to be done and how the graph needs to be built. The scope will not change on that front.
  • Code should run on Amazon EMR or local single spark node.
  • pyspark and dataframes/datasets. We would prefer having NO rdd code. We would prefer latest Spark and Python 2.X
  • For development you can use local spark .You can see free neo4j server at https://elements.heroku.com/addons/graphenedb to test it out. Our production neo4j will also use graphenedb
  • Entry point of spark job should be able to work with the following
    • single CSV file url
    • read from database to get CSV files url - date from and date to
    • read from database to get CSV files url - limit of number of files (from the beginning of database)
    • read from database to get CSV files url - get list of new files (which have not been processed yet). This is a little more involved, so please read below.
  • You cannot move, delete, edit or change postgresql database or s3 files. You will get read only access to both.
  • you will write a flask api (we current use rapache for this) that will expose a few functions as apis. These functions will run some queries on neo4j and return results. You can see examples here. All the cypher queries are already available. we know some of them are suboptimal and need to be fixed.

task not in existing R code - Each run of the spark job will not only create new nodes and edges, but also run pagerank on the whole graph. This means that you will have to load the whole graph into spark. Since the running of Spark is always incremental, we do not want to have to re-process all the s3 files every single time. This means that after each time you finish processing s3 CSV files, you will save the whole graph as a parquet file and write to s3. Then next time, you get new CSV files, you will first create edges and nodes, load the LAST parquet graph, add these new nodes and graphs and then save it again as a new Parquet file. This could should accomodate the situation that the server dies - lets say you are processing 1000 new files and you got killed at 700. you should resume from 701.

NOTE: it is our assumption that the graph in neo4j and parquet will be exactly the same

Comments (4)

  1. Sandeep Srinivasa reporter

    sample milestone (you can propose your own)

    milestone 1:

    • setup of dev environment on local machine
    • (from R code)reading of postgresql database to get CSV file locations (which is in s3)
    • (from R code)first parsing of CSV file in spark (using dataframes/graphframes) to graph format
    • saving of spark graph format in parquet.
    • checking if output can be loaded in neo4j
    • running same in amazon EMR

    milestone 2:

    • (from R code)porting the ETL cleanup
    • (from R code)making sure that the graph creation is like the original R code (for example multiple edges, different types of node, etc.)

    milestone 3:

    • making 4 different types of spark jobs (as given in "explanation"):
      • single CSV file url, date from and date to,limit of number of files (from the beginning of database), get list of new files (which have not been processed yet).
    • creation of lambda job to run code periodically
    • making sure full graph in neo4j is being built

    milestone 4:

  2. Log in to comment