Cleanup of linguistic sensitive regexp/methods

Issue #305 open
Former user created an issue

Original issue 305 created by @ysavourel on 2013-01-03T18:26:35.000Z:

As we process a lot of text, not just move it around (like an email server), we use regular expressions and Character methods in many places.

Sometimes isSpace, or \s, are the right things to use (for instance in parsing some file formats), but sometimes we might use them for linguistic aware operations (segmentation, word count), so we really mean "word separators" not "spaces".

We might want to be more strict at times for parsing
For instance, when we parse a LocaleId, do we really want \p{Alnum}, or really [A-Z0-9a-z]? \p{Alnum} includes things like (real) Arabic digits, Greek and Cyrillic letters, and what not.

And the behavior of \p{...} regexp is Unicode version sensitive, so the meaning it will change in time. Hopefully is getting better, or adds support for new scripts :-) Are we ready to accept that?

Taking a quick look at RegExp.txt might help us detect some tricky areas before they are reported as bugs :-)

==

This was triggered by changes in jdk 7, but it is not really about that, might be a "nice to do"

Comments (3)

  1. Jim Hargrave (OLD)
    • changed status to open

    This would be a nice one to implement. See the new lib-tokenizer as a single example to be more internationalized.

  2. Log in to comment