Project: Spoken Wikipedia audio alignment

This repository contains code to align spoken Wikipedia articles with their respective texts.


  • Wiki Downloader: Downloads audio recordings and corresponding article versions from Wikipedia (also supports downloading other lists of articles) Aligner: Provides all the functionality to align a wikipedia article. This includes transcript extraction from the wiki.html file, normalization and actual alignment. It also provides snippet extraction functionality to generate small snippets from aligned articles which can be used for training of acoustic models. Bootstrapping: A collection of scripts to iteratively improve an acoustic model using wikipedia articles
  • Statistics: Some basic statistics for alignment success

How to run it

This repository contains a script ( that will execute all the steps that are necessary to get alignment data. Here is the current help output, which gives a rough overview of the functionality:

-I, --install                      Install everything.
                                   relevant args: -i
-D, --download                     Download article data.
                                   relevant args: -i -a -l
-P, --gen-prep-jobs                Generate jobs for article preparation.
                                   Extracting transcripts and creating the audio.wav files.
                                   relevant args: -i -a -l -j
-A, --gen-align-jobs               Generate jobs for audio alignment.
                                   relevant args: -i -a -j -m -g -d -A -r
-E, --exec-jobs                    Execute jobs which were generated previously.
                                   relevant args: -i -j -p
--status                           display the amount of jobs that are pending, running, failed, finished
                                   relevant args: -j

Args explained:

-i, --install-dir <directory>      Select where to clone the code repository. Default: code
-a, --article-dir <directory>      Select where to download articles to. Default: articles
                                   Generated files will also be written in subdirectories of this dir.
-l, --language 'german'|'english'  Select which data to download and align. Default: german
-j, --jobset <dir>                 The directory containing the *_jobs dirs.
-m, --model <dir>                  directory of the acoustic model
-g, --g2p <.ser file>              path to an .fst.ser file for g2p conversion
-d, --dict <.dic file>             path to a g2p dictionary
-o, --align-filename <filename>    Filename for the file containing the generated alignments. Default: aligned.swc
-r, --ram <int>                    The amount of ram in GB available to each aligning process. Default: 1
-p, --processes <int>              The number of processes.  Usually the number of cores is good here. Default: 4
                                   pay attention that enough memory for that many processes is available!

To get alignments:

download the script: wget

execute each step and make sure it worked. If no error messages appear, it most likely worked.

bash --install will download all the software that is needed and install it.

bash --download --language <german|english> will download spoken articles for the given language into a directory. This may take a few hours and needs lots of free space.

bash --gen-prep-jobs --language <german|english> --jobset prep_jobs will generate jobs to convert the audio files and extract the transcript from the wiki html and tokenzize&normalize it.

bash --exec-jobs --jobset prep_jobs will execute the generated jobs. You can also add --processes 8 instead of the default 4 if you have 8 cores available.

bash --gen-align-jobs --model <model dir> --g2p <.ser file> --dict <.dic file> --jobset align_jobs will generate jobs to align the prepared audio and transcript files.

bash --exec-jobs --jobset align_jobs will execute the generated jobs.

The generated jobs can be executed in parallel. Because a job is just a script in a directory, if multiple machines have access to that directory and the data required by the script, then jobs can be executed on different machines. This can speed up the whole process quite a bit.

For more information on the script and additional fine tuning switches, first take a look at bash -h and also take a look at the source, there are some variables which can be configured as well but don't have a switch yet.

Building the C# projects

In general: To build the C# projects on linux install the following packages:

apt-get install mono-devel libmono-system-core4.0-cil libmono-system-web-extensions4.0-cil


apt-get istall mono-complete

Then go to the respective directory and do ./ and then ./<Project>.exe or mono <Project>.exe. Like jar files, the compiled .exe will run on windows and linux, regardless of where it was compiled.

Wiki Downloader

Automatically downloads all (or select) Spoken Wikipedia articles.

Usage: mkdir english && ./WikiDownloader.exe english.json [Target_BaseDir] [Article_Titles ...]

One subdirectory will be created for each spoken article. Subdirectory names are first utf-8 encoded and then url-encoded. (Just paste to the url-bar of your browser to decode :) Spaces in article names are replaced with underscore, as is done in Wikipedia itself.

You may add article titles to limit the download to just these articles instead of all articles that are contained in the category (useful for debugging).

How it works: The program uses the Wikimedia API: . If you want to play around, there is a UI located at .

Known Bug: If downloading does not work because of a TLS error, you are probably missing certificates for mono. To fix it you can import all certificates used by mozilla with mozroots --import --sync.

How it works:

First, the category noted in the config.json is queried for a list of articles. Then meta-information is queried for every article. If it was updated since the last run, the content is downloaded and the page is parsed for templates containing the audio file name. The language-dependent templating code is contained in C#-classes (AudioDownloader_XY.cs).

Then metainformation for the audio file is queried, including the actual download url and location of the info page (wikipedia, or wikimedia commons). The info page of the audio file is parsed for templates containing dates, speaker information and a link to the oldid of the article. If the oldid was not found, it is guessed by the date taken from the templates.

Then the audio file and the corresponding article version is downloaded and saved on disk. Gathered metainformation is stored in info.json.

KNOWN BUG: Due to the incompetence of the people who implemented the HTTPClient in the mono framework, handling network errors properly is impossible without implementing your own http client. This means that if you use linux and the program encounters the slightest network error, it will fail! This leaves you with 2 options: Either use windows or run on a server with a gigabit connection. If the program crashes, you can restart it and will automatically continue where it left. It is however recommended to delete the last created folder, before restarting.

Text Extractor

This program will convert wikipedia articles to something closer to the actual spoken text.

Usage: java -jar TextExtractor.jar <directory> <language>

Where directory is the output of Wiki Downloader: info.json is searched for article.title to retrieve the article title, fallback is the directory name.
wiki.html will be parsed. * The resulting text fill be written to audio.txt

Language parameter must be either 'en' or 'de' .


Want to know how well your alignment went? Just cd to the 'gen' directory of prosub and run ./Statistics.exe

Output format is:

aligned% last_word_position% ./directory

If you want a nice graph:

cd gen
./Staticstics.exe > ../summary.txt
cd ..
cat summary.txt | grep ^[0-9] | sort -h > summary-sorted.txt
echo "plot 'summary-sorted.txt' u 0:1 w lines" | gnuplot

Some other statistics are available via (read the source to find out how it works).

MAUS Alignments

Phone-level alignments are done with the M√ľnchener Annotations- Und Segmentierungstool (MAUS). Install (some defined version of) MAUS with which resides in Bootstrapping/ at the moment. You also need to use to install SequiturG2P. Run the following steps

  1. create necessary text snippets and job scripts with java -jar code/Aligner/target/Aligner.jar mausmap articles/ --all --g2p_model YOURG2PMODEL (additional parameters can be set).
  2. run the jobs with -E -p X (where X is your number of processors)
  3. merge back the results: java -jar code/Aligner/target/Aligner.jar mausmerge articles/ --all.

The above should work for German. Wrangling MAUS to align English is a little harder. Try the following:

  1. build a Sequitur model from CMUdict. Don't worry for the moment that CMUdict's entries are all upper-case and that MAUS uses a different phoneset from CMUdict. This is converted by
  2. java -jar code/Aligner/target/Aligner.jar mausmap articles/ --all --maus "maus/maus LANGUAGE=eng-US" --g2p_wrapper sequitur/ --g2p_model YOURG2PMODEL
  3. running and merging as above for German

Generating Acoustic models based on SWC

We provide conversions to two data formats: One for Sphinx (which was used for the alignment) and one for Kaldi.


java -jar code/Aligner/target/Aligner.jar extractsnippets kaldi /dev/null /path/to/articles/ /output/directory/

which will generate four files for kaldi: text, segments, utt2spk, and wav.scp. Attention: If the files already exist, the data will be appended!

the wav.scp contains pseudo-directories ("/path/to/articles/") which needs to be corrected based on the real directory:

sed -i 's|/path/to/articles/|/correct/path/to/articles/|g' wav.scp

(This is because we want to distribute the generated data and don't know where it will reside on the users' computer)

Additionally, an id2spk file is generated which maps the speaker ids to the real speaker names. Not needed by kaldi but nice to have.

Note that the kaldi snippet extraction works for all articles at once (you have to provide the path to the articles dir), whereas the snippet extraction for sphinx works on a per-article basis.


  • Java, Mono, perl, python
  • sox (for audio file conversion and cutting)
  • trang (for relax-ng checking)