Project: Spoken Wikipedia audio alignment
This repository contains code to align spoken Wikipedia articles with their respective texts.
- Wiki Downloader: Downloads audio recordings and corresponding article versions from Wikipedia (also supports downloading other lists of articles) Aligner: Provides all the functionality to align a wikipedia article. This includes transcript extraction from the wiki.html file, normalization and actual alignment. It also provides snippet extraction functionality to generate small snippets from aligned articles which can be used for training of acoustic models. Bootstrapping: A collection of scripts to iteratively improve an acoustic model using wikipedia articles
- Statistics: Some basic statistics for alignment success
How to run it
This repository contains a script (
master_script.sh) that will execute all the steps that are necessary to get alignment data.
Here is the current help output, which gives a rough overview of the functionality:
-I, --install Install everything. relevant args: -i -D, --download Download article data. relevant args: -i -a -l -P, --gen-prep-jobs Generate jobs for article preparation. Extracting transcripts and creating the audio.wav files. relevant args: -i -a -l -j -A, --gen-align-jobs Generate jobs for audio alignment. relevant args: -i -a -j -m -g -d -A -r -E, --exec-jobs Execute jobs which were generated previously. relevant args: -i -j -p --status display the amount of jobs that are pending, running, failed, finished relevant args: -j Args explained: -i, --install-dir <directory> Select where to clone the code repository. Default: code -a, --article-dir <directory> Select where to download articles to. Default: articles Generated files will also be written in subdirectories of this dir. -l, --language 'german'|'english' Select which data to download and align. Default: german -j, --jobset <dir> The directory containing the *_jobs dirs. -m, --model <dir> directory of the acoustic model -g, --g2p <.ser file> path to an .fst.ser file for g2p conversion -d, --dict <.dic file> path to a g2p dictionary -o, --align-filename <filename> Filename for the file containing the generated alignments. Default: aligned.swc -r, --ram <int> The amount of ram in GB available to each aligning process. Default: 1 -p, --processes <int> The number of processes. Usually the number of cores is good here. Default: 4 pay attention that enough memory for that many processes is available!
To get alignments:
download the script:
execute each step and make sure it worked. If no error messages appear, it most likely worked.
bash master_script.sh --install will download all the software that is needed and install it.
bash master_script.sh --download --language <german|english> will download spoken articles for the given language into a directory. This may take a few hours and needs lots of free space.
bash master_script.sh --gen-prep-jobs --language <german|english> --jobset prep_jobs will generate jobs to convert the audio files and extract the transcript from the wiki html and tokenzize&normalize it.
bash master_script.sh --exec-jobs --jobset prep_jobs will execute the generated jobs. You can also add
--processes 8 instead of the default
4 if you have 8 cores available.
bash master_script.sh --gen-align-jobs --model <model dir> --g2p <.ser file> --dict <.dic file> --jobset align_jobs will generate jobs to align the prepared audio and transcript files.
bash master_script.sh --exec-jobs --jobset align_jobs will execute the generated jobs.
The generated jobs can be executed in parallel. Because a job is just a script in a directory, if multiple machines have access to that directory and the data required by the script, then jobs can be executed on different machines. This can speed up the whole process quite a bit.
For more information on the script and additional fine tuning switches, first take a look at
bash master_script.sh -h and also take a look at the source, there are some variables which can be configured as well but don't have a switch yet.
Building the C# projects
In general: To build the C# projects on linux install the following packages:
apt-get install mono-devel libmono-system-core4.0-cil libmono-system-web-extensions4.0-cil
apt-get istall mono-complete
Then go to the respective directory and do ./make.sh and then ./<Project>.exe or mono <Project>.exe. Like jar files, the compiled .exe will run on windows and linux, regardless of where it was compiled.
Automatically downloads all (or select) Spoken Wikipedia articles.
Usage: mkdir english && ./WikiDownloader.exe english.json [Target_BaseDir] [Article_Titles ...]
One subdirectory will be created for each spoken article. Subdirectory names are first utf-8 encoded and then url-encoded. (Just paste to the url-bar of your browser to decode :) Spaces in article names are replaced with underscore, as is done in Wikipedia itself.
You may add article titles to limit the download to just these articles instead of all articles that are contained in the category (useful for debugging).
Known Bug: If downloading does not work because of a TLS error, you are probably missing certificates for mono.
To fix it you can import all certificates used by mozilla with
mozroots --import --sync.
How it works:
First, the category noted in the
config.json is queried for a list of articles. Then meta-information is queried for every article. If it was updated since the last run, the content is downloaded and the page is parsed for templates containing the audio file name. The language-dependent templating code is contained in C#-classes (AudioDownloader_XY.cs).
Then metainformation for the audio file is queried, including the actual download url and location of the info page (wikipedia, or wikimedia commons). The info page of the audio file is parsed for templates containing dates, speaker information and a link to the oldid of the article. If the oldid was not found, it is guessed by the date taken from the templates.
Then the audio file and the corresponding article version is downloaded and saved on disk. Gathered metainformation is stored in info.json.
KNOWN BUG: Due to the incompetence of the people who implemented the HTTPClient in the mono framework, handling network errors properly is impossible without implementing your own http client. This means that if you use linux and the program encounters the slightest network error, it will fail! This leaves you with 2 options: Either use windows or run on a server with a gigabit connection. If the program crashes, you can restart it and will automatically continue where it left. It is however recommended to delete the last created folder, before restarting.
This program will convert wikipedia articles to something closer to the actual spoken text.
Usage: java -jar TextExtractor.jar <directory> <language>
Where directory is the output of Wiki Downloader:
info.json is searched for article.title to retrieve the article title, fallback is the directory name.
wiki.html will be parsed. * The resulting text fill be written to audio.txt
Language parameter must be either 'en' or 'de' .
Want to know how well your alignment went? Just cd to the 'gen' directory of prosub and run ./Statistics.exe
Output format is:
aligned% last_word_position% ./directory
If you want a nice graph:
cd gen ./Staticstics.exe > ../summary.txt cd .. cat summary.txt | grep ^[0-9] | sort -h > summary-sorted.txt echo "plot 'summary-sorted.txt' u 0:1 w lines" | gnuplot
Some other statistics are available via
speakerAnalysis.pl (read the source to find out how it works).
Phone-level alignments are done with the Münchener Annotations- Und Segmentierungstool (MAUS).
Install (some defined version of) MAUS with
install_maus.sh which resides in
Bootstrapping/ at the moment.
You also need to use
install_sequitur.sh to install SequiturG2P.
Run the following steps
- create necessary text snippets and job scripts with
java -jar code/Aligner/target/Aligner.jar mausmap articles/ --all --g2p_model YOURG2PMODEL(additional parameters can be set).
- run the jobs with
master_script.sh -E -p X(where X is your number of processors)
- merge back the results:
java -jar code/Aligner/target/Aligner.jar mausmerge articles/ --all.
The above should work for German. Wrangling MAUS to align English is a little harder. Try the following:
- build a Sequitur model from CMUdict. Don't worry for the moment that CMUdict's entries are all upper-case and that MAUS uses a different phoneset from CMUdict. This is converted by
java -jar code/Aligner/target/Aligner.jar mausmap articles/ --all --maus "maus/maus LANGUAGE=eng-US" --g2p_wrapper sequitur/caselessg2p.pl --g2p_model YOURG2PMODEL
- running and merging as above for German
Generating Acoustic models based on SWC
We provide conversions to two data formats: One for Sphinx (which was used for the alignment) and one for Kaldi.
java -jar code/Aligner/target/Aligner.jar extractsnippets kaldi /dev/null /path/to/articles/ /output/directory/
which will generate four files for kaldi:
wav.scp. Attention: If the files already exist, the
data will be appended!
wav.scp contains pseudo-directories ("/path/to/articles/") which
needs to be corrected based on the real directory:
sed -i 's|/path/to/articles/|/correct/path/to/articles/|g' wav.scp
(This is because we want to distribute the generated data and don't know where it will reside on the users' computer)
id2spk file is generated which maps the speaker ids
to the real speaker names. Not needed by kaldi but nice to have.
Note that the kaldi snippet extraction works for all articles at once
(you have to provide the path to the
articles dir), whereas the
snippet extraction for sphinx works on a per-article basis.
- Java, Mono, perl, python
- sox (for audio file conversion and cutting)
- trang (for relax-ng checking)