HTTPS SSH

Minimalistic Speech Aligner using Sphinx4 and VoxForge Models

This is an end-to-end example of taking an audio book and aligning it to its source material.

I did it to obtain a speech sample dictionary for an art project back in 2010. Over the years a number of people expressed interest in this code, I am making it available under the same terms of the Sphinx4 library (from which is it heavily based nevertheless).

To try it

You will need Java (at least Java6) and ant. An example wave file and tokenized text is included with the project.

Just do:

ant build

then

ant Aligner

the aligned output will be print to the screen:

Aligner: [java] <sil>(0.07,0.33) the(0.33,1.29) golden(1.29,2.86) snare(2.86,4.01) by(4.01,4.15) james(4.15,4.59) oliver(4.59,5.82) [java] <sil>(5.82,6.74) chapter(6.74,7.81) one(7.81,8.3) [java] <sil>(8.3,8.82) all(8.82,9.18) other(9.18,9.4) things(9.4,9.74) a(9.74,9.8) creature(9.8,10.48) of(10.48,10.59) envir onment(10.59,11.49) [java] <sil>(11.49,12.31) heart(12.31,13.86) of(13.86,14.94) a(14.94,15.28) devil(15.28,16.71) in(16.71,17.36) [java] <sil>(17.36,18.33) the(18.33,18.66) other(18.66,18.94) man(18.94,19.19) bram(19.19,21.11) himself(21.11,22.95) should (22.95,23.46) not(23.46,23.74) be(23.74,23.86) blamed(23.86,24.42)

You most probably want to put these results in a separate file, do something else with the whole thing. Dig into the code (src/ folder), it is only one java file (and an extra, unimportant java file). Most of the magic is in the provided config.xml file.

Running it on your own text

Using this on your own text has the pre-step of transforming text into "spoken text" (compare "at 5pm" with "at five p m"). For that I used the Festival text-to-speech system. The code is in the scripts/ folder but you'll need a Linux machine with the festival system installed.

To run it change to the scripts folder then do

cat ../README.md | perl tokens_to_words.pl

You will get an output like this:

minimalistic speech aligner using sphinx four and voxforge models line of equals this is an end to end example of taking an audio book and aligning it to its source material . i did it to obtain a speech sample dictionary for an art project back in twenty ten . over the years a number of people expressed interest in this code , i am making it available under the same terms of the sphinx four library ( from which is it heavily based nevertheless ) . to try it line of hyphens you will need java ( at least java six ) and ant . an example wave file and tokenized text is included with the project . just do : ant build then ant aligner the aligned output will be print to the screen : aligner : [ java ] < sil > zero dot zero seven zero dot thirty three ) the zero dot thirty three one dot twenty nine ) golden one dot twenty nine two dot eighty six ) snare two dot eighty six four dot zero one ) by f

My class on building synthetic voices contains some pointers on how to build festival outside of Linux:

https://github.com/DrDub/building_synthetic_voices_workshop