1. Philip Arthur
  2. qaclef-python

Overview

HTTPS SSH
NAIST-QACLEF
----------------------------------------------------------------------------
Author         : Philip Arthur
Release Year   : 2013

This software is the technical implementation of the following paper:
http://www.phontron.com/paper/arthur13clef.pdf

In the collaboration with Graham Neubig for the training algorithm (T-MERT) that 
maximizes the C@1 objective function. 
http://phontron.com/tmert 

Note: the file 'tmert.py' is the slight modification of original tmert. The modification
is to encapsulate run_mert method so the script can be imported from other python script.

License:
----------------------------------------------------------------------------
All rights reserved. This program and the accompanying materials
are made available under the terms of the GNU Lesser General Public License
(LGPL) version 2.1 which accompanies this distribution, and is available at
http://www.gnu.org/licenses/lgpl-2.1.html

This library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.


Release LOG
---------------------------------------------------------------------------
v1.1.0
  - Add POS TAG
  - Add Word Splitter 
  - Add Matrix view to Report
  - Change from Tupple based to List based
  - Change Porter Stemmer to Word Net Lemmatizer 
  - Save Preprocessed Data separately
  - Readjust l from 1.0 to 0.5

v1.0.0
- Initial Release

Installation:
----------------------------------------------------------------------------
[ FUNDAMENTAL ]
- Python 2.7
  The software is mainly written in python. So you need to install python first.
  Currently it supports only python 2.7.x.
  http://www.python.org/download/releases/2.7/

- Java
  Java is used in some part where program needs to connect with stanford's jar. 
  Also include java in your environment variable "PATH". The java script in lib path
  was compiled with javac 1.6.0_27
  http://www.oracle.com/technetwork/java/javase/overview/index.html

[ SPECIALIZED ]
- NLTK
  NLTK is and stands for natural language toolkit. This software is used for preprocessing.
  http://nltk.org/install.html

- Stanford NER
  Stanford Named Entity Recognition is used as a part of preprocessing. Please download the jar

  http://nlp.stanford.edu/software/CRF-NER.shtml,

  EXTRACT and PUT the name of its path in configuration.py. 
  The default configuration is '/usr/share/stanford-ner/stanford-ner.jar'
  Please change it according to your machine.

  The used tagset is english.conll.4class.distsim.crf.ser.gz 
  [it bounds with correference resolution method in preprocessing.py]

- Stanford Parser
  Stanford Parser is used as part of preprocessing to split the input sentence.

  http://nlp.stanford.edu/software/lex-parser.shtml

  same with StanfordNER please EXTRACT and PUT the name of its path in configuration.py.

Technical Usage Questions
------------------------------------------------------------------------------
1. What is the main script?
	- The main script is 'qaclef.py'

2. I want to work with QA4MRE GS 2011 data only [train and test with the data] how can I?
	- By executing 'python qaclef.py --selftest --data 2011'

3. I want to work with multiple data and test on them, how can I?
	- By executing 'python qaclef.py --selftest --data 2011 2012 2013' 
  *example when working with 2011, 2012, 2013*

4. I want to train the system with 2011 and test it on 2012, how can I?
	- By executing 'python qaclef.py --data 2011 --test 2012 --train' 

5. I want to change the parameter (e.g. the ngram or the threshold), how can I
	- adding parameter --ngram=[some_positive_integer] and --threshold=[some_float] would do.

Some Technical Details
-------------------------------------------------------------------------------
1. This software runs good in at least 4GB memory.

2. The software basically stores its output on file system. 
   You can find the detail of the output in 'cache' folder.

3. The software is equipped with 'by-passing' feature that will by pass some process,
   if it founds the saved output in 'cache' folder (according to its process).
	
  - Preprocess -> will look on '[edition]-preprocessed.txt' 
		.) It will take the particular edition data on corresponding file.
		.) To force preprocessing add '--preprocess' flag
    .) Run software with '--preprocessonly' to do preprocessing only.

	- TMERT Training -> will look on 'weight.txt'
		.) To force training add '--train' flag

Cache
-------------------------------------------------------------------------------
As it is mentioned above, the software will write its process in file system. 
However not all of them are used to by pass processes.

Some of them are intended for debugging and analyzing. They are:
 - 1 - 5 [preprocessing-details] -> for preprocessing
 - final.txt -> final test model (written in json)
 - qa-mert_input.txt -> generated input for tmert.py. 
   If you wish to run tmert seperately, this is the generated train data input for latest run

Report
-------------------------------------------------------------------------------
The system can generate human readable document by adding '--report' for the execution.
The report is written to 'report/report.txt'

FAQ
-------------------------------------------------------------------------------
1. How can I add feature to the system?
	- Basically we add details of the algorithm in metric.py and just map it on scoring.py
2. How can I refine the preprocessing of the system?
	- By modifying preprocessing.py
3. How can I modify the model of the system.
	- Please take a look at model_builder.py

Contact
-----------------------------
Philip Arthur: philip.arthur.om0@is.naist.jp
Graham Neubig: neubig@is.naist.jp