1. Grzegorz Chrupała
  2. lingo
  3. Pull requests

Pull requests

#1 Merged

Added Text based version of tokenizer

  1. creswick

This patch adds a NLP.TokenizeText module with a Data.Text version of the tokenization logic. I've also moved the String based version into NLP.TokenizeStr for symetry. The NLP.Tokenize module now just re-exports everything in NLP.TokenizeStr, so the API has only been expanded (so only a minor version number bump should be needed according to the PVP -- I did not make that change to the cabal file.)

I also added a criterion-based benchmarking suite that compares Text, String, Text wrappers around the String tokenizer and String wrappers around the Text tokenizer. You can run it with: bench <file full of text>. I've included the results of a run across ~8 years of Linux mailing list messages below (about 120mb of text--about the minimum I could run on and avoid issues related to performance fluctuations on my laptop. The run took about 2-3 hours.).

The results indicate that packing/unpacking Text is really not worth it, so having two parallel implementations does have a notable performance benefit. Text seems to perform about 5-10% faster than the string based implementation (in other runs I've done, using the Chatter benchmark suite, text reported improvements over String that were closer to 10+%. I think the source text may cause slightly different results; in this run I used pure mailman logs, while the Chatter suite extracted message text and ran on that, so there input text has different characteristics -- eg: message headers are tokenized more aggressively than message bodies, so the computation of the parsing logic may have played a larger part in the runs.)

Here are the benchmark results:

benchmarking tokenizing/Native String Tokenizer collecting 100 samples, 1 iterations each, in estimated 2212.281 s mean: 21.67443 s, lb 21.63361 s, ub 21.72715 s, ci 0.950 std dev: 235.9879 ms, lb 193.7825 ms, ub 294.8072 ms, ci 0.950

benchmarking tokenizing/Native Text Tokenizer collecting 100 samples, 1 iterations each, in estimated 2048.695 s mean: 20.55730 s, lb 20.50936 s, ub 20.62033 s, ci 0.950 std dev: 279.5278 ms, lb 227.3801 ms, ub 368.8535 ms, ci 0.950 found 6 outliers among 100 samples (6.0%) 5 (5.0%) high mild 1 (1.0%) high severe variance introduced by outliers: 6.586% variance is slightly inflated by outliers

benchmarking tokenizing/Text->Text based on String Tokenizer collecting 100 samples, 1 iterations each, in estimated 2429.706 s mean: 24.35726 s, lb 24.32054 s, ub 24.43227 s, ci 0.950 std dev: 258.0486 ms, lb 156.4269 ms, ub 465.4122 ms, ci 0.950

benchmarking tokenizing/String->String based on Text Tokenizer collecting 100 samples, 1 iterations each, in estimated 3136.921 s mean: 28.91592 s, lb 28.87561 s, ub 28.98426 s, ci 0.950 std dev: 263.5932 ms, lb 176.4128 ms, ub 402.4597 ms, ci 0.950

  • Learn about pull requests

Comments (0)