Consolidating a Rich Instructional System: Combining Educational Computation with Killer Analysis (CRIS: CECKA)
- Jacob Pritt email@example.com
- R.J. Aquino firstname.lastname@example.org
- Tommy MacWilliam email@example.com
Data was obtained from the MySQL databases powering CS50's in-house discussion forums. Because MySQL is not running on Resonance nodes, MySQL data was converted to portable SQLite database files that could be accessed on compute nodes. These files were generated using the included mysql_to_sqlite* scripts and have been included in the submission, so these scripts need not be run again. The various Python files in the project output SQLite databases or JSON, which have also been included in the submission.
The first component of our project is graph construction. generate_graph_serial.py contains a serial implementation of graph construction, and can be run with python generate_graph_serial.py [users|words] db. An argument of "users" will generate a social network graph, in which nodes represent users and edges represent interactions between users, while an argument of "words" will generate a word graph, in which nodes represent words and edges denote that two words were used within the same post or reply. The second argument is the SQLite file to use. The output of graph generation scripts is JSON.
cd into the graph folder:
To generate the social network graph for all CS50 data:
python generate_graph_serial.py users ../output/discuss50
To generate the word graph for all CS50 data:
python generate_graph_serial.py words ../output/discuss50
The script also optionally takes threshold arguments, whereby an edge will be added to the graph if and only if its weight is within the given threshold. For example, to generate a graph with edges with weights only in [10, 100], the following can be run:
python generate_graph_serial.py users ../output/discuss50 10 100
generate_graph_mpi.py contains a parallel implementation using MPI. Its usage is similar to that of the serial implementation. In all cases, the number of processes must be a power of 2.
To generate the social network graph for all CS50 data:
mpirun -n 8 python generate_graph_mpi.py users ../output/discuss50
To generate the word graph for all CS50 data:
mpirun -n 8 python generate_graph_mpi.py words ../output/discuss50
To generate a graph with edges with weights only in [10, 100], the following can be run:
mpirun -n 8 python generate_graph_mpi.py users ../output/discuss50 10 100
Finally, we also implemented graph generation using MapReduce.
To generate the social network graph, we must first write the database to a text file to be processed by MapReduce. To do this, execute:
python print_threads_mr.py users ../output/discuss50 ../output/discuss50_raw.txt
using the argument 'users' to generate the social network graph and 'words' to generate the word graph. Next, run MapReduce as follows:
python generate_graph_mr.py < ../output/discuss50_raw.txt > ../output/discuss50_connections.txt
The resulting graph (either social network or word connections, depending on the argument to print_threads_mr.py) will be saved to discuss50_connections.txt
All-Pairs Shortest Paths
Next, we used Dijkstra's algorithm to compute the shortest paths between all nodes in the graph. Before these scripts can be run, a graph construction script must be run and saved as a JSON file. For example, after running:
mpirun -n 8 python generate_graph_mpi.py users ../output/discuss50 > ../output/network.json
Code for this section is in the paths folder in the root of the project:
The following can be run to compute all shortest paths and save them in a SQLite database called paths.db:
python path_serial.py ../output/network.json sql ../output/paths.db
The paths can also be saved as JSON via:
python path_serial.py ../output/network.json json
path_mpi.py contains a parallel implementation using MPI. The following can be run to compute all shortest paths and save them in a SQLite database called paths.db:
mpirun -n 8 python path_mpi.py ../output/network.json paths sql ../output/paths.db
Rather than outputting all nodes in the path, the following can be run to output only the lengths of those paths (which significantly reduces the size of the output):
mpirun -n 8 python path_mpi.py ../output/network.json weights json
We also generate a second word graph with which we can construct "sentences" based on what words most commonly follow other words. Unlike the previous graph, which contains a weighted edge for every pair of words that appear in the same thread, this graph contains directional edges between words that appear in succession. For example, the sentence "hello world" would yield a graph with a directional edge from node 'hello' to 'world'.
The code for this section is in the sentences folder:
To generate this graph using MapReduce, we must first write the database to a file to be read by MapReduce, as follows (from the graph directory):
python print_threads_mr.py words ../output/discuss50 ../output/discuss50_raw.txt
Next, run MapReduce:
python word_succession_parallel.py < ../output/discuss50_raw.txt > ../output/discuss50_successors.txt
The resulting graph is written by MapReduce to discuss50_connections.txt. Finally, to parse the results and write the graph to a database, run (from the scripts directory):
python mr_output_to_db.py ../output/discuss50_successors.txt ../output/sentence_prediction.db
The saved database contains a table of every word, each with an ordered list of the 10 words that most commonly follow it.
We can also generate this sentence database using MPI, as follows:
mpirun -n 8 python sentence_predictor_mpi.py ../output/discuss50 ../output/discuss50_connections.db
Companies like Google and Facebook utilize social network graphs to quantify relationships between users. For example, Facebook's News Feed takes into account the closeness of two friends, and GMail's suggested contacts takes into account how users email each other. Using ideas from both Google's and Facebook's (public) algorithms, we have created our own EdgeRank algorithm for our data set.
We fundamentally have two algorithms - one to calculate the "affinity" between two users, and another to calculate the importance of an action for a given user. We calculate the affinity between two users as the sum of all action-scores for actions that occur between them. The action score is computed as the weight of the action times a time decay factor. Put more simply, some actions carry more weight (replying to a post carries more weight than viewing a post) and recent actions carry more weight than past actions. Summing this score over all actions allows us to calculate a "friendship score" between two users.
To compute a user's newsfeed, we need to ascribe scores to the many actions that occur within Discuss. We have defined the the importance score for a newsfeed item as the product of the affinity scores for both users with the current user, along with the weight of the action, and a time decay factor. For the newsfeed, we bolster the importance of the time decay factor, as a newsfeed should have primarily recent items.
Code for this section is in the edgerank folder:
In order to calculate the friendship scores across all pairs of friends, we recognize that each action's contribution to a friendship score can be computed independently. Thus, we instruct each MPI process to pull from the database its own sliver of data, over which it can run the computation described above. At the conclusion of this process, each MPI process has a partial frienship score for (potentially) each pair of users. In order to figure out the combined score for each pair, we need to consolidate the scores back into one process. We do that using a tree-based reduction kernel, so that we do as many of the additions as possible in paralell. We implemented this functionality in edgescore_serial.py and edgescore_mpi.py, which can be run as follows:
python edgescore_serial.py ../output/discuss50 > ../output/edgescores.json mpirun -n 4 python edgescore_mpi.py ../output/discuss50 > ../output/edgescores_p.json
We also wrote an equality checker to confirm that our serial and parallel code output the same result. Due to the time decay factor, we ensure that all the keys are the same, and the values are within a threshold of .000001. To run the checker, the following commands should be run:
cd ../scripts python test_edgescore_json.py ../output/edgescores.json ../output/edgescores_p.json
To calculate the newsfeed, we use these friendship scores in addition to the action weights and time decay factors, as described above. This time, there is no need to consolidate the scores, as each action receives only one score. Thus, we again use MPI to poll different sections of the database, but we gather back to the root process, instead of the logarithmic reduction kernel. We implemented this functionality in newsfeed_serial.py and newsfeed_mpi.py, which can be run with the following commands:
cd ../edgerank python newsfeed_serial.py ../output/edgescores.json ../output/discuss50 600 mpirun -n 4 python newsfeed_mpi.py ../output/edgescores.json ../output/discuss50 600
Where edgescores.json is a friendship score system described above, discuss50.db is a SQLite database, and 600 is the user id of the user for whom we want to create a newsfeed.
To explore the results of our computations, we have created several interactive web pages.
newsfeed.html is a simple, personalized "news feed" for users. Using the results of the EdgeRank algorithm, the most relevant "stories" (e.g., new posts, new replies, new watches) are computed for each user. A user's news feed can be viewed by selecting their name from the dropdown. The web page currently computes the news feed in serial, in order to make the project as easy to set up as possible (i.e., without the need to install MPI). If users are not particularly active on the discussion board, then their news feeds may be short or blank.
suggestions.html allows users to browse their "suggested friends," a feature found on social networks like Facebook and Twitter. A user's suggested friends are defined by the neighboring nodes with the highest affinity scores computed by our EdgeRank algorithm. When a user is chosen from the page's dropdown menu, the top suggested friends will be displayed.