News for "Fundamentals of Digital Archeology"

!!Please note the change in final exam dates!!

[Dec 9]

  • Reports for the final project are due

[Dec 6 Tomatohead (downtown) class lunch]

  • 12-1:30 12 Market Square, Knoxville, TN 37902
  • Please email if you can make it

[Dec 5 MK405 2:45-4:45]

  • Alexa Top 500 Websites
  • Airline delays
  • ChatBot
  • Predicting Website Trends
  • Shodan
  • Exoplanets
  • S&P 500
  • Course selection data

[Nov 30]

  • Presentation by the Knuckleball team
  • Presentation by the JTV Product Review Mining team
  • Presentation by the Reddit Sentiment analysis team

[Nov 28]

  • Presentation by the Top 500 HPC team
  • Presentation by the Tractor sales team

[Nov 23]

  • Presentation by the CVE team

[Nov 21-23, Nov 28-30, and during the final exam on Dec 5, MK405 2:45-4:45]

  • Final project presentations

[Nov 11, 14]

  • Course evaluations open, your input is greatly appreciated --
  • I'll be in class in case you have any questions
  • GCloud probably ran out of credits for you by now, if you need to use it please let me know.
  • To access mongodb on da2 containers please specify the host:

[Nov 7, 9, 12, 16, 18]

  • No class (you are welcome to use the classroom for the final project meetings)

[Oct 28-31, Nov 2-18]

  • Work on final projects

[Oct 31]

  • Lecture (really the last one): How to interpret regression results in R

[Oct 27]

  • Due to kernel vulnerability patch da2 was rebooted and docker containers restarted: apologies for any interruptions it may have caused

[Oct 28]

  • No class (you are welcome to use the classroom for the final project meetings)

[Oct 21, 24, 26]

[Oct 19]

[Oct 17]

  • Discussion of discovery projects: should include for eacch API
    1. Why not used/retrieved
    2. If retrieved: a pointer to the script, the collection in mongodb, and the analysis in R
      See, for example,

[Oct 14]

  • Work on Data discovery project

[Oct 12]

  • Work on Data discovery project

[Oct 10]

  • Data analysis using R [3]
  • R requirements for the data discovery

    1. Export data from mongo into .csv, e.g.,

      from pymongo import MongoClient
      client = MongoClient('')
      db = client['d1discovery']
      f = open('rating.csv','w')
      cursor = db.cityrating.find();
      for record in cursor:
         lr = list(record.keys())
         f.write (lr[1]+','+str(record[lr[1]])+"\n")
    2. Read data into R

    3. Obtain the summary of the data
    4. Pose a hypothesis (dependent and at least one independent variable)
    5. Do transformation if needed
    6. Fit a model, interpret, do diagnostics
  • Examples are in

[Oct 5]

  • Data analysis using R [2]

[Oct 3]

  • Data analysis using R [1]

[Sep 30]

  • Final projects selected
  • Clarifications on Discovery project
  • Data analysis using R [1]
  • Notes on MongoDB (X in DXDiscovery is your project number)

    1) if you use mongo shell

       #on da2
       > use DXDiscovery
       > ...
       #on gcloud
       > use DXDiscovery
       > ...

    2) if you use python

       # on da2 
       client = pymongo.MongoClient (host="")
       db = client ['DXDiscovery']
       # on gcloud 
       client = pymongo.MongoClient ()
       db = client ['DXDiscovery']
    • Notes on retrieving via rest API (see BitBucket REST API example in
      1. Key part is
           #HTTP GET/POST
           r = requests .get ('' + repo, auth=(login, passwd))
           #convert resulting json into python data structure
           jt = json.loads (r.text)

[Sep 28]

  • Teasers for the final project (continued)

[Sep 26]

  • Teasers for the final project
  • Remaining questions on GCloud
  • Whoever creates DXdiscovery fork, please grant the remaining two team members write permissions to fdac/DXdiscovery

[Sep 23]

[Sep 21]

  • Data discovery project (continued)
  • Ideas for the final project
  • The last presentation on A1
    1. John Qiu - A character level LSTM trained on tweets from the Twitter handle "therealdonaldtrump.

[Sep 19]

  • The last two presentation on A1
    1. Timothy Oesch - Comparison of Colossians, I Peter, and II Peter
    2. John Qiu - A character level LSTM trained on tweets from the Twitter handle "therealdonaldtrump
  • Data discovery project
  • Work on teasers for the final project

[Sep 16]

  • Guest lecture: Graph Analytics in Healthcare Applications by Dr. Rangan Sukumar
  • Abstract: Finding actionable insights from data has always been difficult. As the scale and forms of data increase tremendously, the task of finding value ecomes even more challenging. Data scientists at Oak Ridge National Laboratory are leveraging unique leadership infrastructure (e.g. Urika-XA and Urika-GD appliances) to develop scalable algorithms for semantic, logical and statistical reasoning with unstructured Big Data. We present the deployment of such a framework called ORiGAMI (Oak Ridge Graph Analytics for Medical Innovations) on the National Library of Medicine’s SEMANTIC Medline (archive of medical knowledge since 1994). Medline contains over 70 million knowledge nuggets published in 23.5 million papers in medical literature with thousands more added daily. ORiGAMI is available as an open-science medical hypothesis generation tool - both as a web-service and an application programming interface (API) at . Since becoming an online service, ORIGAMI has enabled clinical subject-matter experts to: (i) discover the relationship between beta-blocker treatment and diabetic retinopathy; (ii) hypothesize that xylene is an environmental cancer-causing carcinogen and (iii) aid doctors with diagnosis of challenging cases when rare diseases manifest with common symptoms. In 2015, ORiGAMI was featured in the Historical Clinical Pathological Conference in Baltimore as a demonstration of artificial intelligence to medicine, IEEE/ACM Supercomputing and recognized as a Centennial Showcase Exhibit at the Radiological Society of North America (RSNA) Conference in Chicago. This class will describe the fundamentals leading to the design and deployment of ORiGAMI along with a tutorial on how to use the tool.

[Sep 14]

  • Final reminders for A1

    1. Please remove the part of template that is irrelevant to your project!
    2. Please commit in all your code and presentations and create a pull req
  • The presentations selected by groups on Sep 12 will be presented for the entire class. Aaaand the nominees areeee....

    1. Justin Nguyen - Analyzing key speeches from Martin Luther King Jr. and Malcom X.
    2. Tanner Hobson - Analysis of reading ease level (based on "Flesch–Kincaid readability tests") and cosine similarity of tfidf vectors from 142 different books, each from a different author.
    3. Kriss Gabourel - ...
    4. Tyler Stuessi - Analysis and Comparison of Donald Trump and Hilary Clinton
    5. Tyler Marshall - Analysis of the most used words in song lyrics by billboard genre.
    6. Kelly Deuso - Presidential speeches and their word frequencies.
    7. Timothy Oesch - Comparison of Colossians, I Peter, and II Peter
    8. John Qiu - A character level LSTM trained on tweets from the Twitter handle "therealdonaldtrump."
  • Think about the final project
    • Involves data discovery/retrieval/storage/analysis/presentation
    • Try to find topic that you like or data that you want to analyze
    • Prepare a brief presentation to recruit others to join your project
    • The final project will have a four page final report
    • Examples of final projects done for the last two years are at and
    • Potential data sources for the final project

[Sep 12]

  • The groups for in-class presentations are defined by the chart A1Groups based on this doc2vec analysis of issue descriptions
    1. Each group is defined by the dendrogram going from top in batches of six (too many groups in batches of five)
    2. E.g., the first group is
      1. Stephen M, Kevin S, Alex C, Tianxiang C, Justin N, and Caleb M.
      2. The second group starts from Tanner H
      3. The third from Seth R,
      4. The fourth from from Phillip V
      5. The fifth from Justin W,
      6. The sixth from Hans Kodi,
      7. The seventh from Tyler S
      8. The eight from Austin D. If your name is not in the dendrogram link, you are in the eight group.
    3. Please find each other in the class to sit in a single group so you can present to each other.
    4. If you have unoccupied office nearby, please take your group with you: it will be tight in the classroom.
    5. If you are not around but are available online, please try to join your group via skype/hangouts/zoom/etc. If you are not available, I'll ask someone from your group to present.
    6. Given the size of each group keep your presentations down to 5 min (with questions) no matter how wonderful it is.
    7. Group will first select a moderator to remind presenters when their 5 min are up and a moderator backup (to remind the moderator when its their turn to present)
    8. The presentation order is the one in the dendrogram unless group chooses another order.
    9. The group will elect a representative to present their or all of group's work to the entire class. The criteria for selection and the decision should be decided within the group.

[Sep 7]

  • Analysis of activities for Assignment1
  • How to solve encoding error: replace f = open(fname) with f = open(fname, errors='ignore'). Also, please see a function to clean special character submitted by John Qiu

  • Remaining timline for Assignment1

    1. In case you encounter any problems retrieving, storing, or analyzing data please discuss them with your peer and, if needed, escalate to me.
    2. Prepare a few slides explaining the approach and findings and present to your peer no later than Sep 9. Presentations should note:

           * What is the question?
           * What was the approach?
           * What problems did I encounter?
           * What results did I get?
           * What new ideas did this generate?
    3. We will have entire class on Sep 9 to resolve any remaining issues

    4. We will have final presentations in several groups during the class on Sep 12.

[Sep 1]

[Aug 31]

  • Everyone successfully completed Homework0!
  • Today we finished going over Assignment1
    1. Please note the new way to fork: to the owner fdac and project Assignment1
  • FAQ
    1. Q: Do I need to keep stuff in Answer1.ipynb? A: Absolutely not, you can/should create a new notebook USERNAME_Asignment1.ipnb
    2. Q: How do I raise issue if my peer has not created a fork yet? A: The forks are due Sep 1: thats why we have multiple steps in this assignment. If the fork is not there yet: raise issue regarding absence of the fork on fdac/news
    3. Q: How come I need to have X number of commits by certain date? A: To avoid situation when all work is done at the end, or the intermediate results are not saved. The purpose of this is to practice frequent commits: do something, then commit before it disappears.

[Aug 29]

  • Homework0
    is due

  • Lecture on data discovery finished

  • Will introduce first data analysis assignment

  • Common question: "when I try to ssh into my container saying that the host key may have been changed. By default it's using strict host key checking which I could turn off. What's the best action for this?"
    Answer: remove offending key from .ssh/known_hosts (or remove .ssh/known_hosts)

[Aug 26]

[Aug 24]

[Aug 23] Update

  • The access to the servers is now available

    • ssh to using windows (or mac/linux)
      instructions: if you are able to do that, then on your laptop
      browser please enter http://localhost:8888 to access your
      personal ipython notebook server. You can create python2,
      python3, and R notebooks. We will primarily use python3 and,
      later in the course, R

    • Alternatively, if you'd prefer to run the ipython notebooks on your own
      laptop (not from the sever) please install docker
      infrastructure per

      • The docker image used for the class is audris/ipython-pymongo16:latest
        • You will need to forward the port 8888 to port 8888 on that docker container on your laptop
        • You will also need to start 'ipython notebook --no-browser' in the container

[Aug 22] Homework Due Aug 26

  • The second task -- configuring ssh
  • Finish the first task (all done!!!)
    • Please make sure your file is not empty and contains meaningful info. In particular, check if your .md file is in fdac/students and, if not, please submit a pull request.
    • I got all .md files but am still waiting for pull requests with .pub files
      from several of you, please submit your .pub files if you
      have not done so yet. I need it to add you to the class
      organization and to enable paswordless login.

Aug 19

Aug 17

The first tasks

Syllabus for "Fundamentals of Digital Archeology"

  • Course: [COSCS-445/COSCS-545]
  • MK405 2:30-3:20 MWF
  • Instructor: Audris Mockus,
  • TA: Tapajit Dey, Office hours at MK620 2-4 pm on Tuesday
  • Simple rules for getting help:
    1. There are no stupid questions. However, it may be worth going over the following steps:
    2. Think of what the right answer may be.
    3. Search online: stack overflow, etc.
    4. Look through issues
    5. Post the question as an issue.
    6. Ask instructor: email for 1-on-1 help, or
      to set up a time to meet


The course will combine theoretical underpinning of big data with
intense practice. In particular, approaches to ethical concerns,
reproducibility of the results, absence of context, missing data,
and incorrect data will be both discussed and practiced by writing
programs to discover the data in the cloud, to retrieve it by
scraping the deep web, and by structuring, storing, and sampling it
in a way suitable for subsequent decision making. At the end of the
course students will be able to discover, collect, and
clean digital traces, to use such traces to construct meaningful
measures, and to create tools that help with decision making.

Expected Outcomes

Upon completion, students will be able to discover, gather, and analyze
digital traces, will learn how to avoid mistakes common in
the analysis of low-quality data, and will have produced a working
analytics application.

In particular, in addition to practicing critical thinking,
students will acquire the following skills:

  • Use Python and other tools to discover, retrieve, and process data.

  • Use data management techniques to store data locally and in the cloud.

  • Use data analysis methods to explore data and to make predictions.

Course Description

A great volume of complex data is generated as a result of human
activities, including both work and play. To exploit that data for
decision making it is necessary to create software that discovers,
collects, and integrates the data.

Digital archeology relies on traces that are left over in the course
of ordinary activities, for example the logs generated by sensors in
mobile phones, the commits in version control systems, or the email
sent and the documents edited by a knowledge worker. Understanding
such traces is complicated in contrast to data collected using
traditional measurement approaches.

Traditional approaches rely on a highly controlled and well-designed
measurement system. In meteorology, for example, the temperature is
taken in specially designed and carefully selected locations to
avoid direct sunlight and to be at a fixed distance from the ground.
Such measurement can then be trusted to represent these controlled
conditions and the analysis of such data is, consequently, fairly

The measurements from geolocation or other sensors in mobile phones
are affected by numerous (yet not recorded) factors: was the phone
kept in the pocket, was it indoors or outside? The devices are not
calibrated or may not work properly, so the corresponding
measurements would be inaccurate. Locations (without mobile phones)
may not have any measurement, yet may be of the greatest interest.
This lack of context and inaccurate or missing data necessitates
fundamentally new approaches that rely on patterns of behavior to
correct the data, to fill in missing observations, and to elucidate
unrecorded context factors. These steps are needed to obtain
meaningful results from a subsequent analysis.

The course will cover basic principles and effective practices to
increase the integrity of the results obtained from voluminous but
highly unreliable sources.

  • Ethics: legal aspects, privacy, confidentiality, governance

  • Reproducibility: version control, ipython notebook

  • Fundamentals of big data analysis:
    extreme distributions, transformations, quantiles,
    sampling strategies, and
    logistic regression

  • The nature of digital traces:
    lack of context,
    missing values, and
    incorrect data


Students are expected to have basic programming skills, in
particular, be able to use regular expressions, programming concepts
such as variables, functions, loops, and data structures like lists
and dictionaries (for example, COSC 365)

Being familiar with version control systems (e.g., COSC 340), Python
(e.g., COSC 370), and introductory level probability (e.g., ECE 313)
and statistics, such as, random variables, distributions and
regression would be beneficial but is not expected. Everyone is
expected, however, to be willing and highly motivated to catch up in
the areas where they have gaps in the relevant skills.

All the assignments and projects for this class will use git and
Python. Knowledge of Python is not a prerequisite for this course,
provided you are comfortable learning on your own as needed. While
we have strived to make the programming component of this course
straightforward, we will not devote much time to teaching
programming, Python syntax, or any of the libraries and APIs. You
should feel comfortable with:

  1. How to look up Python syntax on Google and StackOverflow.
  2. Basic programming concepts like functions, loops, arrays, dictionaries, strings, and if statements.
  3. How to learn new libraries by reading documentation and reusing examples
  4. Asking questions on StackOverflow or as a GitHub/BitBucket issue.


These apply to real life, as well.

  • Must apply "good programming style" learned in class
    • Optimize for readability
  • Bonus points for:
    • Creativity (as long as requirements are fulfilled)

Teaming Tips

  • Agree on an editor and environment that you're comfortable with
  • The person who's less experienced/comfortable should have more keyboard time
  • Switch who's "driving" regularly
  • Make sure to save the code and send it to others on the team


  • Class Participation – 15%: students are expected to read all
    material covered in a week and come to class prepared to take
    part in the classroom discussions. Responding to other student
    questions (or issues in BitBucket) counts as classroom participation.

  • Assignments - 40%: Each assignment will involve writing (or modifying a template of)
    a small Python program.

  • Project - 45%: one original project done in a group of 2 or 3
    students. The project will explore one or more of the themes covered
    in the course that students find particularly compelling. The
    group needs to submit a project proposal (2 pages IEEE format)
    approximately 1.5 months before the end of term. The proposal
    should provide a brief motivation of the project, detailed
    discussion of the data that will be obtained or used in the project,
    along with a time-line of milestones, and expected outcome.

Other considerations

As a programmer you will never write anything from scratch, but will
reuse code, frameworks, or ideas. You are encouraged to
learn from the work of your peers. However, if you don't try to do
it yourself, you will not learn. [Deliberate practice][deliberate-practice]
(activities designed for the sole purpose of effectively improving
specific aspects of an individual's performance) is the only way to
reach perfection.

Please respect the terms of use and/or license of any code you find,
and if you re-implement or duplicate an algorithm or code from
elsewhere, credit the original source with an inline comment.



This class assumes you are confident with this material, but in case you need a brush-up...


  • A MongoDB Schema Analyzer. One JavaScript file that you run with the mongo shell command on a database collection and it attempts to come up with a generalized schema of the datastore. It was also written about on the official MongoDB blog.
R and data analysis
Tutorials written as ipython-notebooks


Similar for GitHub