Stewart Keystroke and Stylometry Dataset
Access to this repository is provided under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. In short, the dataset is intended for research and cannot be used for commercial applications.
Publications making use of this dataset should make the following citations:
- Stewart, John C and Monaco, John V and Cha, Sung-Hyuk and Tappert, Charles C, "An investigation of keystroke and stylometry traits for authenticating online test takers" International Joint Conference on Biometrics (IJCB) 2011.
- Monaco, John V Monaco and Stewart, John C and Cha, Sung-Hyuk and Tappert, Charles C, “Behavioral Biometric Verification of Student Identity in Online Course Assessment and Authentication of Authors in Literary Works,” Biometrics: Theory, Applications and Systems (BTAS) 2013.
(Taken from publication 1 above)
The data were collected from 40 students of a spreadsheet modeling course in the business school of a four-year liberal arts college. The classes met in a desktop computer laboratory where the exams were administered. Although this study investigated an online examination application, the data were captured in a classroom setting for greater experimental control. The 40 students took four online short-answer tests of 10 questions each, the tests spaced at approximately two week intervals. The students were unaware that their data were being captured for experimental analysis.
There were several problems with the keystroke data collection system. It was run from a weak server which ran slowly with 40 students accessing the system. The software was designed so students would click the NEXT button to go to the next question after completing the current one, not allowing a return to previous questions. However, due to the slowness of the system response to the click, some students would click on the NEXT button several times when there was not an immediate response and this would result in skipped questions. Also, some students could not remember the usernames and passwords they created on the first test and consequently could not log into the second; for the third and fourth tests this problem was resolved by the instructor providing the usernames and passwords when requested. As a result of these data collection problems, data were removed from students not completing all four tests or answering a sufficient number of questions per test, resulting in complete data sets from 30 students, 17 male and 13 female.
The text lengths of the answers to a test ranged from 433 to 1831 words per test, with a mean of 966 and a median of 915 words. An average word length of five characters (six with spaces between words) yields roughly 6000 keystrokes per test as input to the keystroke system.
All the tests were taken on classroom Dell desktop computers with associated Dell keyboards. Training and testing on the same type of keyboard is optimal because it is known that keystroke data tends to vary for different keyboards, different environmental conditions, and different types of texts.