HTTPS SSH

Consolidating a Rich Instructional System: Combining Educational Computation with Killer Analysis (CRIS: CECKA)

Team

Data

Data was obtained from the MySQL databases powering CS50's in-house discussion forums. Because MySQL is not running on Resonance nodes, MySQL data was converted to portable SQLite database files that could be accessed on compute nodes. These files were generated using the included mysql_to_sqlite* scripts and have been included in the submission, so these scripts need not be run again. The various Python files in the project output SQLite databases or JSON, which have also been included in the submission.

Graph Generation

The first component of our project is graph construction. generate_graph_serial.py contains a serial implementation of graph construction, and can be run with python generate_graph_serial.py [users|words] db. An argument of "users" will generate a social network graph, in which nodes represent users and edges represent interactions between users, while an argument of "words" will generate a word graph, in which nodes represent words and edges denote that two words were used within the same post or reply. The second argument is the SQLite file to use. The output of graph generation scripts is JSON.

cd into the graph folder:

cd graph

To generate the social network graph for all CS50 data:

python generate_graph_serial.py users ../output/discuss50

To generate the word graph for all CS50 data:

python generate_graph_serial.py words ../output/discuss50

The script also optionally takes threshold arguments, whereby an edge will be added to the graph if and only if its weight is within the given threshold. For example, to generate a graph with edges with weights only in [10, 100], the following can be run:

python generate_graph_serial.py users ../output/discuss50 10 100

MPI

generate_graph_mpi.py contains a parallel implementation using MPI. Its usage is similar to that of the serial implementation. In all cases, the number of processes must be a power of 2.

To generate the social network graph for all CS50 data:

mpirun -n 8 python generate_graph_mpi.py users ../output/discuss50

To generate the word graph for all CS50 data:

mpirun -n 8 python generate_graph_mpi.py words ../output/discuss50

To generate a graph with edges with weights only in [10, 100], the following can be run:

mpirun -n 8 python generate_graph_mpi.py users ../output/discuss50 10 100

MapReduce

Finally, we also implemented graph generation using MapReduce.

To generate the social network graph, we must first write the database to a text file to be processed by MapReduce. To do this, execute:

python print_threads_mr.py users ../output/discuss50 ../output/discuss50_raw.txt

using the argument 'users' to generate the social network graph and 'words' to generate the word graph. Next, run MapReduce as follows:

python generate_graph_mr.py < ../output/discuss50_raw.txt > ../output/discuss50_connections.txt

The resulting graph (either social network or word connections, depending on the argument to print_threads_mr.py) will be saved to discuss50_connections.txt

All-Pairs Shortest Paths

Next, we used Dijkstra's algorithm to compute the shortest paths between all nodes in the graph. Before these scripts can be run, a graph construction script must be run and saved as a JSON file. For example, after running:

mpirun -n 8 python generate_graph_mpi.py users ../output/discuss50 > ../output/network.json

Code for this section is in the paths folder in the root of the project:

cd ../paths

The following can be run to compute all shortest paths and save them in a SQLite database called paths.db:

python path_serial.py ../output/network.json sql ../output/paths.db

The paths can also be saved as JSON via:

python path_serial.py ../output/network.json json

MPI

path_mpi.py contains a parallel implementation using MPI. The following can be run to compute all shortest paths and save them in a SQLite database called paths.db:

mpirun -n 8 python path_mpi.py ../output/network.json paths sql ../output/paths.db

Rather than outputting all nodes in the path, the following can be run to output only the lengths of those paths (which significantly reduces the size of the output):

mpirun -n 8 python path_mpi.py ../output/network.json weights json

Sentence Predictions

We also generate a second word graph with which we can construct "sentences" based on what words most commonly follow other words. Unlike the previous graph, which contains a weighted edge for every pair of words that appear in the same thread, this graph contains directional edges between words that appear in succession. For example, the sentence "hello world" would yield a graph with a directional edge from node 'hello' to 'world'.

The code for this section is in the sentences folder:

cd ../sentences

MapReduce

To generate this graph using MapReduce, we must first write the database to a file to be read by MapReduce, as follows (from the graph directory):

python print_threads_mr.py words ../output/discuss50 ../output/discuss50_raw.txt

Next, run MapReduce:

python word_succession_parallel.py < ../output/discuss50_raw.txt > ../output/discuss50_successors.txt

The resulting graph is written by MapReduce to discuss50_connections.txt. Finally, to parse the results and write the graph to a database, run (from the scripts directory):

python mr_output_to_db.py ../output/discuss50_successors.txt ../output/sentence_prediction.db

The saved database contains a table of every word, each with an ordered list of the 10 words that most commonly follow it.

MPI

We can also generate this sentence database using MPI, as follows:

mpirun -n 8 python sentence_predictor_mpi.py ../output/discuss50 ../output/discuss50_connections.db

EdgeRank

Companies like Google and Facebook utilize social network graphs to quantify relationships between users. For example, Facebook's News Feed takes into account the closeness of two friends, and GMail's suggested contacts takes into account how users email each other. Using ideas from both Google's and Facebook's (public) algorithms, we have created our own EdgeRank algorithm for our data set.

We fundamentally have two algorithms - one to calculate the "affinity" between two users, and another to calculate the importance of an action for a given user. We calculate the affinity between two users as the sum of all action-scores for actions that occur between them. The action score is computed as the weight of the action times a time decay factor. Put more simply, some actions carry more weight (replying to a post carries more weight than viewing a post) and recent actions carry more weight than past actions. Summing this score over all actions allows us to calculate a "friendship score" between two users.

To compute a user's newsfeed, we need to ascribe scores to the many actions that occur within Discuss. We have defined the the importance score for a newsfeed item as the product of the affinity scores for both users with the current user, along with the weight of the action, and a time decay factor. For the newsfeed, we bolster the importance of the time decay factor, as a newsfeed should have primarily recent items.

Code for this section is in the edgerank folder:

cd ../edgerank

MPI

In order to calculate the friendship scores across all pairs of friends, we recognize that each action's contribution to a friendship score can be computed independently. Thus, we instruct each MPI process to pull from the database its own sliver of data, over which it can run the computation described above. At the conclusion of this process, each MPI process has a partial frienship score for (potentially) each pair of users. In order to figure out the combined score for each pair, we need to consolidate the scores back into one process. We do that using a tree-based reduction kernel, so that we do as many of the additions as possible in paralell. We implemented this functionality in edgescore_serial.py and edgescore_mpi.py, which can be run as follows:

python edgescore_serial.py ../output/discuss50 > ../output/edgescores.json
mpirun -n 4 python edgescore_mpi.py ../output/discuss50 > ../output/edgescores_p.json

We also wrote an equality checker to confirm that our serial and parallel code output the same result. Due to the time decay factor, we ensure that all the keys are the same, and the values are within a threshold of .000001. To run the checker, the following commands should be run:

cd ../scripts
python test_edgescore_json.py ../output/edgescores.json ../output/edgescores_p.json

To calculate the newsfeed, we use these friendship scores in addition to the action weights and time decay factors, as described above. This time, there is no need to consolidate the scores, as each action receives only one score. Thus, we again use MPI to poll different sections of the database, but we gather back to the root process, instead of the logarithmic reduction kernel. We implemented this functionality in newsfeed_serial.py and newsfeed_mpi.py, which can be run with the following commands:

cd ../edgerank
python newsfeed_serial.py ../output/edgescores.json ../output/discuss50 600
mpirun -n 4 python newsfeed_mpi.py ../output/edgescores.json ../output/discuss50 600

Where edgescores.json is a friendship score system described above, discuss50.db is a SQLite database, and 600 is the user id of the user for whom we want to create a newsfeed.

Visualizations

To explore the results of our computations, we have created several interactive web pages.

Social Network

network.html contains a visualization of the social network graph saved in network.json. The graph is initially laid out as a circle, but a randomized layout can also be selected. By hovering over a node in the graph, the user can view all of the nodes directly connected to that node, as well as the user represented by that node. The same effect can be accomplished via the filtering dropdown to the right of the graph. The slider at top-right filters the graph by hiding those nodes for which the sum of all edges falls below the given threshold. Finally, users can view the shortest path between any two nodes in the graph via the final two dropdown menus. The JavaScript for this page can be found in network.js, and Ajax requests are made to paths.php, which uses the SQLite database generated by Python to output paths.

Sentence Prediction

sentences.html is an interactive web page that allows users to explore auto-suggested sentences. After typing a word, selecting "Next word" will display a list of words that commonly follow the inputted word, and clicking a suggestion will repeat the process for that word. So, by repeatedly choosing words from the suggestion list, a sentence will be generated. Clicking "I'm feeling lucky" will automate this process for a fixed number of words but repeatedly choosing a random word from the suggestion list. The JavaScript for this page can be found in sentences.js, and Ajax requests are made to sentences.php, which uses the SQLite database generated by Python to output suggested words.

News Feed

newsfeed.html is a simple, personalized "news feed" for users. Using the results of the EdgeRank algorithm, the most relevant "stories" (e.g., new posts, new replies, new watches) are computed for each user. A user's news feed can be viewed by selecting their name from the dropdown. The web page currently computes the news feed in serial, in order to make the project as easy to set up as possible (i.e., without the need to install MPI). If users are not particularly active on the discussion board, then their news feeds may be short or blank.

Suggested Friends

suggestions.html allows users to browse their "suggested friends," a feature found on social networks like Facebook and Twitter. A user's suggested friends are defined by the neighboring nodes with the highest affinity scores computed by our EdgeRank algorithm. When a user is chosen from the page's dropdown menu, the top suggested friends will be displayed.