1. Jörg Tiedemann
  2. subalign


subalign /

Filename Size Date modified Message
268 B
634 B
1.1 KB
24.2 KB
5.3 KB
Some scripts for processing movie subtitles

srt2xml    .... convert subtitles in srt-format to simple OPUS-style XML 
                format (does sentence splitting and tokenization)
                (uses nonbreaking_prefix.* files for tokenization
                 which are just copies from the files distributed with 
                 the Europarl corpus version 3)

		Note that subtitle files are usually DOS files and 
		srt2xml expects UNIX-style text files! 
		--> use dos2unix before piping the text into srt2xml.pl

srtalign... ... align srt-files which have been converted to XML using 
		srt2xml (requires time-stamps!)
		For more information on using this script and its options:
		Look at the header of the script!

share/dic ..... This directory contains word alignment dictionaries
		obtained by aligning the OpenSubtitles corpus from OPUS
		These dictionaries can be used to improve sentence 
		alignment by synchronizing time stamps with the help of
		anchor points found by matching dictionary entries with
		word pairs in the subtitle pair