Pushed to
tiedemann/subalign
0429c6e
another fix for malformed XML (non-unicode characters)
Atlassian Sourcetree is a free Git and Mercurial client for Windows.
Atlassian Sourcetree is a free Git and Mercurial client for Mac.
Some scripts for processing movie subtitles srt2xml .... convert subtitles in srt-format to simple OPUS-style XML format (does sentence splitting and tokenization) (uses nonbreaking_prefix.* files for tokenization which are just copies from the files distributed with the Europarl corpus version 3) Note that subtitle files are usually DOS files and srt2xml expects UNIX-style text files! --> use dos2unix before piping the text into srt2xml.pl srtalign... ... align srt-files which have been converted to XML using srt2xml (requires time-stamps!) For more information on using this script and its options: Look at the header of the script! share/dic ..... This directory contains word alignment dictionaries obtained by aligning the OpenSubtitles corpus from OPUS These dictionaries can be used to improve sentence alignment by synchronizing time stamps with the help of anchor points found by matching dictionary entries with word pairs in the subtitle pair