There are rows in the postgres database that point to CSV files in s3. These s3 files are generated every hour. They represent a connection between two people. Each person is identified by phone number.

Now, the connection between two people may be repeated in subsequent CSV files - but we will create a new edge each and every time it comes. This in some way "reinforces"/"strengthens" the connection between two people. We annotate the date on this edge. This way we can ask question - what are all the connection between DATE1 and DATE2 ?

Reading the CSV file needs some ETL cleanup - because phone numbers in india can come with +91, 0091, 0, etc. Some files are in UTF-8 encoding, etc.

This graph then needs to be loaded into neo4j for some OLTP queries.

Scope of project

We have implemented this in R already. You will be given that code to see how the ETL is supposed to be done and how the graph needs to be built. The scope will not change on that front.
Code should run on Amazon EMR or local single spark node.
pyspark and dataframes/datasets. We would prefer having NO rdd code. We would prefer latest Spark and Python 2.X
For development you can use local spark .You can see free neo4j server at https://elements.heroku.com/addons/graphenedb to test it out. Our production neo4j will also use graphenedb
Entry point of spark job should be able to work with the following
- single CSV file url
- read from database to get CSV files url - date from and date to
- read from database to get CSV files url - limit of number of files (from the beginning of database)
- read from database to get CSV files url - get list of new files (which have not been processed yet). This is a little more involved, so please read below.
You cannot move, delete, edit or change postgresql database or s3 files. You will get read only access to both.
you will write a flask api (we current use rapache for this) that will expose a few functions as apis. These functions will run some queries on neo4j and return results. You can see examples here. All the cypher queries are already available. we know some of them are suboptimal and need to be fixed.

task not in existing R code - Each run of the spark job will not only create new nodes and edges, but also run pagerank on the whole graph. This means that you will have to load the whole graph into spark. Since the running of Spark is always incremental, we do not want to have to re-process all the s3 files every single time. This means that after each time you finish processing s3 CSV files, you will save the whole graph as a parquet file and write to s3. Then next time, you get new CSV files, you will first create edges and nodes, load the LAST parquet graph, add these new nodes and graphs and then save it again as a new Parquet file. This could should accomodate the situation that the server dies - lets say you are processing 1000 new files and you got killed at 700. you should resume from 701.

NOTE: it is our assumption that the graph in neo4j and parquet will be exactly the same

explanation

Scope of project

Comments (4)