Source

recoprov / pan12-detailed-comparison-training-corpus-small / readme.txt

Full commit
========
Overview
========

This archive contains the training corpus for the "Plagiarism Detection: Detailed Comparison" task of the PAN 2012 Lab, held in conjunction with the CLEF 2012 conference.

Find out about all the details at http://pan.webis.de.



===========================
Training Corpus Description
===========================

The corpus comprises 

/susp: 1.804 suspicious documents as plain text.
/src : 4.210 source documents as plain text. 

The suspicious documents contain passages 'plagiarized' from the source documents, obfuscated with one of five different obfuscation techniques. See [1] on page 75 for a detailed description of these techniques.

Furthermore, the corpus contains 6.000 XML files which each report for a pair of suspicious and source document the exact locations of the plagiarized passages. The XML files are split into six datasets:

/01_no_plagiarism: XML files for 1.000 document pairs without any plagiarism.
/02_no_obfuscation: XML files for 1.000 document pairs where the suspicious document contains exact copies of passages in the source document.
/03_artificial_low: XML files for 1.000 document pairs where the plagiarized passages are obfuscated by the means of moderate word shuffling.
/04_artificial_high: XML files for 1.000 document pairs where the plagiarized passages are obfuscated by the means of not so moderate word shuffling.
/05_translation: XML files for 1.000 document pairs where the plagiarized passages are obfuscated by translation into a different language.
/06_simulated_paraphrase: XML files for 1.000 document pairs where the plagiarized passages are obfuscated by humans via Amazon Mechanical Turk.

In addition to the XML files, each folder contains a text file called 'pairs'. For the 1.000 document-pairs (XML files) in the folder, this file lists the filename of the suspicious and the source document in a row, separated by a blank:

  suspicious-document00086.txt source-document00171.txt

If you like to evaluate your detection program with respect to one of the datasets, let it produce for each document pair in the 'pairs'-file a detection XML file which reports on the plagiarized passages it found. The detection XML file must have the following structure:

<document reference="...">    <!-- file name of the suspicious document        -->
<feature
  name="detected-plagiarism"  <!-- type of the plagiarism annotation           -->
  this_offset="5"             <!-- char offset within the suspicious document  -->
  this_length="1000"          <!-- number of chars beginning at the offset     -->

  source_reference="..."      <!-- file name of the source document            -->
  source_offset="100"         <!-- char offset within the source document      -->
  source_length="1000"        <!-- number of chars beginning at the offset     -->

/>
...                           <!-- more detections in this suspicious document -->
</document>

The XML structure is identical to the reference XML files contained in this training corpus. Only the value of the name attribute in the feature elements differs (it's 'plagiarism', there).

To evaluate the quality of your detection, you can either use the python script 'perfmeasures.py' or the TIRA evaluation service. In either case, place all the detection XML files for a dataset into a common folder <your-detection-dir>.
If you use the script, call it with:
 
  perfmeasures.py -p <reference-dir> -d <your-detection-dir> 
  
As <reference-dir>, specify the respective folder with the reference XML files (e.g. 01_no_plagiarism).
If you use TIRA, compress your detection folder as a zip file and upload it via the form on the web page. 

The script and the web service are available via the PAN 2012 web page, http://pan.webis.de/.



==========
References
==========

[1] http://www.uni-weimar.de/medien/webis/publications/papers/potthast_2011b.pdf