Pushed to teamwildtreechase/hatparsing
51ccc24 Change the code in EmptyLineRemover to use the metods available in
hatparsing : Visualization, Search and Analysis of Hierarchical Translation Equivalence in Machine Translation Data Copyright (C) <2013> <Gideon Maillette de Buy Wenniger> contact : gemdbw AT gmail DOT com Installation: This software has been developed under eclipse using the automatically generated ant build/make file. While eclipse is not nescessary of course to work with this code. it seems an effective way to navigate it, as well as automatically compile the whole project. A cleaner hand tailored ant build file would be better, but still needs to be written. The project has a weak dependency of itext 5.1.1 for some of the Visualization functions (exporting to pdf in particular) but otherwise the compilation of the code should be very straightforward. Further dependencies are junit for testing (currently junit-4.9b2.jar is used in the libraries folder) and sqlitejdbc (currently sqlitejdbc-v056.jar is used in the libraries folder) which is used for selecting HATs with specific properties and prior to that building a database of Alignment Triples with associated properties based on the HATParsing. The versions of junit and sqllitejdbc should not matter too much, provided version 4 or higher is used for the former and also an up to date version is used for the latter. Also they are included in the repository already in any case. Please make sure that these two libraries are added in the Libries tab in the Java Build Path, or include them during comiplation. For making simple to use execuatble jar files, the author advises usage of the "FatJar" software, available at : http://fjep.sourceforge.net/. Ofcourse, any other manual or tool-based way of producing jars can work as well, as can simple execution of the main java class directly from the command line or from Eclipse. For convenience the Download section of bitbucket already contains a pre-compiled and packaged executable hatViewer.jar, which is basically a jar that contains everything (including itext) and can be used to directly work with the gui using: >>> java -jar hatViewer.jar Of course, for anything beyond simply working with the Viewer installing the actual source code is recommended. USAGE: >>A very short description of the usage and structure<< The main class for the main visualizer tool is HATViewer.java. This class contains the main method that builds the GUI. This class is located in: hatparsing/src/extended_bitg . Running this class will open the viewer. As mentioned before, this main class is also opened from the pre-complied jar hatViewer.jar available from the Downloads section of this project. >>Manual Input of Examples<< The viewer contains three input fields, one for SourceString, one for TargetString and one for Aligment. Through this three fields it is possible to manually input alignment triples and parse them, or edit triples that have been opened from files. By going to the "File" menu in the upper left of the GUI, the user can select "Open Source Sentence". "Open Target Sentences" and "Open Alignments" which allow for the specification and opening of the associated files throught the GUI, one at a time. >>Configuration Files<< As another alternative, which is faster and more convenient for repeated use, a configuration file can be used to specify the locations of these files. For example, an minimal configuration file can look as follow: corpusLocation = /home/userX/SomeFolder/ExampleCorpus/ sourceFileName = sourceFile.txt targetFileName = targetFile.txt alignmentsFileName = alignmentFile.txt In other words, the corpusLocation is specified as an absolute path, in this example "/home/userX/SomeFolder/ExampleCorpus/" and the source, target and alignment file name are specified only by their name (they are assumed to all reside in the corpus location folder). (Please make sure to not change the casing of the keys of the configuration file keys (e.g. targetFileName) as these keys are case-sensitive.) >>HAT Selection<< Right of the "HAT Visualization" tab in the GUI there is a "HAT Selection" tab present. This tab allows selection of specific Alignment Triples with certain properties based from a database with Alignment Triples and associated properties. In the main window there is a button "Make ConfigFile for Database" which automatically produces a configuration file with generated source-, target- and alignment file locations given a pre-computed HATsDatabase file. >>HAT Database Generation<< Before selection can work, a database with HAT properties for Alignment Triples is required. The computation of a new database is somewhat more involved and requires currently two steps. First, using the main method in the class "hatparsing/src/alignmentStatistics/MultiThreadTreeStatisticsGenerator.java, global corpus statistics are computed for a aligned corpus, while at the same time a csv file with example-specific properties for each triple is produced (bascially a table in the form of a CSV file). Second another class in "hatparsing/src/hat_database/HATSelectionGenerator.java" is used to compute the actual database file, using the method "recomputeDatabase()". --- Admittedly, it would be better to integrate these two steps of building the database in a single script, or even better, in a GUI. One more detail is that it is in principle possible to build a database for several lanugage pairs (we did this for English-French. English-German, and English-German) which is one of the main reason that the second step of computing the database requires many configuration files, namely the locations of all relevant files for all desired language pairs. Making this part of the program more user friendly is something we are currently working on. We are thinking to add a separate GUI or window within the main GUI for this purpose, were the locations/properties for the corpera of one or mutiple language pairs can be specified, and computation can be done simply and efficiently in one go. (A disadvantage is that since HATParsing for large corpera is expensive, it can benefit a lot from multi-threading, and hence running a scripted version on a server with many cores may be more attractive). >>Basic Alignments Viewer<< For real basic (non-hiearchica) alignment visualization, a class "hatparsing/src/viewer/BasicAlignmentVisualizer.java" exists in the project. This class does just that and nothing else. While the HATVisualizer captures all this functionality as well, including the option to export the rendered alignment to PDF ("File - > Export Alignment to PDF") sometimes simple is easier, and thus better for certain uses, which motivates this simple verison. This is also included in the downloads section as a separate jar under "basicAlignmentVisualizer.jar".