1. Ruprecht von Waldenfels
  2. paravoz2


ParaVoz - a simple web interface for querying parallel corpora, version 2.0

The ParaVoz package provides a simple, yet effective interface for a parallel corpus
using OpenCWB (http://cwb.sourceforge.net). It should work on any linux machine
with only minimal changes in the settings files to reflect paths, and language codes. 
All settings are found in the settings directory.

ParaVoz 2.0 extends (but not replaces) ParaVoz 1.0 and is more intuitive, but probably 
less suited for corpus with a large number of languages; it is best used with a corpus 
of two or three language. In distinction to ParaVoz 1, with ParaVoz 2.0, the parallel 
corpus is encoded as a single corpus file for each language, rather than for each text 
in the corpus. ParaVoz 2.0 now supports both sentence and word alignment. 

For ParaSol 2.0, see the demo at http://parasolcorpus.org/ParaVoz ). For ParaSol 1.0, 
see the movie on the ParaSol website (http://parasolcorpus.org; movie at 

1) Use CWB (http://cwb.sourceforge.net) to encode your parallel corpus in the following way
a) Encode each file as fullcorpus_ln, with ln standing for a two letter language code, 
e.g. fullcorpus_en, fullcorpus_ru, fullcorpus_de for an English, Russian and German parallel corpus.
b) Align the corpora using tags named <alig_LN1_LN2>, with LN1 and LN2
standing for the language codes of the pairwise aligned versions (e.g., <alig_EN_RU> in the above case).
c) Annotation with a "lemma" and a "tag" attribute is supported out of the box
d) For word alignment, include a positional attribute with the word aligned forms into the other languages in each corpus (see the demo versions for an example).

2) Then, unpack ParaVoz to some place where Apache can see it (e.g., /var/www/htdocs/ParaVoz) and
a) edit settings/init.php to reflect the correct corpus (variable $PARCORPUSDIR) and registry ($REGISTRY) path as well as positional attributes other than tag and lemma
b) optionally, edit settings/languages.json to change languages available in corpus
as follows:
	- in file there's a list of language objects (in {} brackets)
	- each language object consists of four fields:
	i) name: conventional language abbraviation (e.g. "en", "pl" etc.)
	ii) corpus: name of CWB-powered corpus for that language
	iii) use: if value is set as 'true' (without quotes), this language
	   becomes available through the query page
	iv) primary: set 'true'/'false' (without quotes) to indicate primary language
c) file settings/meta.json stores information about corpus metadata, and can be
   edited in a similar way. Available fields:
   i) name: name of metadata field
   ii) value: this should be left empty
   iii) type: type of metadata field, right now only "text" option is available
   iv) hint: as metafield names could be any computer-readable strings, this field
       allows to provide human-readable text
   v) inResults: set 'true'/'false' (without quotes) to indicate if field should
	  be visible in the results page
d) interface translation is possible through angular-gettext module (see 
If you want to add another language, you should edit js/translations.js
and add custom translations according to following structure: 
	gettextCatalog.setStrings('custom_language_code', {"eng_term1": "transl1", "eng_term2": "transl2", ...})
You can also prepare .po file from template file (po/template.pot) manually or using 
applications like PoEdit (see poedit.net) and then use Grunt file (Gruntfile.js in main 
directory) to generate new translations.js file. For more information about using Grunt
see gruntjs.com.
If you see [MISSING] tag (e.g. [MISSING]: Search), it means that translation string is 
not found in translations.js.

3) Enjoy your parallel corpus with ParaVoz 
(http://www.youtube.com/watch?v=7Z9xHzufmOk ). 


Description of files:
1. Folders:
folder css: holds css style sheets for the CBW HTML 
folder js: java script functions for query page (including translations.js which
stores data for multilanguage interface translation)
folder jsapp: angularJS scripts for query page
folder po: files used to generate translations.js file
folder settings: all settings, mainly CWB and Corpus paths and language choosing

2. Query page 
autocomplete.php: provides forms autocomplete functionality
index.html: main entrance into the corpus / query page
query_form.php: query page html (called by index.php)
query_form_objects.php: query page functions (called by index.php)
query_table_of_texts.php: table with texts on query page (called by index.php)
results.php: main results page, forks for XML-based and CWB HTML-based concordance
styl.less: additional stylesheet file

2.1 XML-based concordance files 
results_xml.php: concordance of results
results_context_xml.php: concordance of results
parallel-csv.xsl: XSLT sheet for csv result export
parallel-export.xsl: XSLT sheet for XML export
parallel-kwic.xsl: XSLT sheet for concordance

2.3. Important files in settings directory (need to be changed to reflect corpus path etc.)
init.php: CWB and corpus paths, context specs for CWB query 
languages.json: stores information about languages used for queries
meta.json: information about metadata

2.4 Important files for interface translation:
js/translations.js: stores translation strings
Gruntfile.js: can be used to generate new translations.js file
package.json: configuration file for Grunt
po/template.pot: template file which stores every string that should be translated

ParaVoz interface was designed to be easy to understand and use, but some explanations should be helpful. 

A. Language windows
Depending on settings in settings/languages.json file there would be
two (most typically) or more language windows. The first window is for primary language.
With these windows you can create CQP query to search your parallel corpora. 
Each of windows consits of two main areas: basic search and CQP search. 
1) Basic search consists of at least one row containig forms which corresponds to one token 
in CQP query. There are three fields: for word form search, for lemma search and 
for grammatic tag search. With plus and minus signs you can add/remove additional rows.
Below there's metadata field. If you want to specify metadata information, you should 
click on 'Show' button, to expand this field. In metadata area you can add proper
data in fields and decide (by clicking a checbox next to the field) if this field
shoud be shown in results.
Most of the changes in Basic search area are reflected in the CQP search field.
2) CQP search: if you use Basic search, you can see exact CQP query here. If you feel
confident, you can also modify this query to get advanced functions (warning: if CQP field
was manually altered, any changes in a Basic search area will erase these modifications).
You can use 'Clear' button to quickly erase all fields in CQP and Basic search.
The '@' sign in the primary language CQP field means that results should be color-coded to
indicate word alignment. 

B. Other elements
Below windows are located buttons for getting results. First button opens new page 
with results, second generates XML file with results. You can also change interface 
language in the upper-right corner of the query page .


This web interface to CWB was initially written by Roland Meyer for use with the
ParaSol corpus (then Regensburg Parallel Corpus) in 2006 and has since been in
development by successive authors. The java script based functionality was mainly added by Andreas Zeman, XSLT-support in the new modular interface mainly by Ruprecht von Waldenfels, who has supervised the publication as open source. Part of the architecture is described in Waldenfels (2011). We thank the Center for the Study of Language and Society, University of Berne, (http://www.csls.unibe.ch) for granting financial support enabling the publication of ParaVoz as open source at this stage. 
ParaVoz 2.0 was then developed during the work on a German-Polish parallel corpus supported by a grant of the Johannes Gutenberg University Mainz; mostly by Michal Wozniak, with valuable input from Jan Machalica and under supervision by Ruprecht von Waldenfels.

If you use the interface, please cite it as
Roland Meyer, Ruprecht von Waldenfels, Michal Wozniak, Andreas Zeman (2006-2015): ParaVoz - 
a simple web interface for querying parallel corpora. Second Version. Bern, Regensburg, Berlin, Krakow.


Copyright (C) 2006-2015 Roland Meyer, Ruprecht von Waldenfels, Michal Wozniak, Andreas Zeman

This program is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with
this program; if not, write to the Free Software Foundation, Inc., 59 Temple
Place, Suite 330, Boston, MA 02111-1307 USA