Scripts for Corpura stuff
Some years ago a frind was making PhD in Linguistics/Corpora and needed help processing some data. I don't remember exactly what I did, but I have what I did: this 3 scripts. If they are usefull to someone, I don't care about licenses/copyright, do wathever you want with them. I remember that my friend had transcription of interviews made with the software EXMARaLDA, that saves in .exb format. But for a 2nd phase, regarding the align process of the texts, the plan was to use the sofwatre YouAlign, but it was not able to import .exb; one option was to import in HTML: one of the scripts converts from exb to HTML if I remember correctly. But YouAlign had limitations on the number of HTML files at a time, only permited a very reduced number (they were undread of files), so I opted to join all the HTML files in one big file divided with keywords, so that we were able to converte in one batch: one script converts from many HTML files to a single one (using a specific structure needed for the software). The 3rd script manipulates the resulting files from the software YouAlign (TMX format), but I don't remeber what it was, only peeking inside the codes, but no time for that, this is past to me.
I share this in case it is useful to someone working with Linguistics/Copora.
My Homepage: www.paulojorgepm.net