- changed title to Gender game: Highlighting regex not working as expected
Gender game: Highlighting regex not working as expected
Despite the code (é|foi) (um|uma|o|a)
in the regex to highlight Portuguese pronouns in the gender game, the phrases "é um" (e.g. in Q16498338, pt:Bino Farias) or "é o" aren't recognized. Instances of "foi um", and "foi o" are correctly recognized, however. Could this be an encoding issue?
Comments (24)
-
reporter -
Looks like there is some bad escaping based on this quick test (using JShint)
-
Strike that since changing é to e in text and regexp gives correct match
-
reporter Oh, I think I know what's going on here. Accented characters don't count (as they should) as word characters, so the word boundary \b followed by é doesn't match. See this SO answer. I guess replacing the first
\b
with a space should provide a sensible workaround. -
reporter - changed status to resolved
Fixed in 3187ae1.
-
Hum I'm thinking that the regexp should maybe be
[ \(](ele|ela|(é|foi)( (um|uma|o|a)))\b
in order to not mark single "foi" and allow preceding parenthesis. You can try the following sting with the different regexps:
Foi o rei do lugare com o nome Foi Grande (é uma mentira).uma mentira).
-
reporter The single "foi" is there on purpose, since I came across many articles that omit the pronoun in the phrase describing who the person is. So highlighting "foi" or "é" is useful to quickly locate the noun that describes the person (usually an occupation), which most of the time provides information on their gender.
As for parenthesis... is there really a need for it? I've looked at many hundreds of article intros in the past few days and haven't come across a single instance of "foi" or "é" inside a parenthesis...
-
Ah. In that case never mind. I was mainly suggesting thins which might bring the RegExp closer to the original behaviour. Does the Spanish RegExp (and others using characters outside of [a-zA-Z]) work correctly?
-
reporter I haven't tried any of the other languages, but I'm willing to bet "él" isn't being highlighted for the very same reason (
\b
followed byé
). As soon as a solution is decided for#101, the Spanish regex should be changed too. -
Just tested polish and found that it does not highlight był/była. I would guess that highlighting breaks for any string containing characters outside of [a-zA-Z] which today is 5/11 of the languages with gender markers. Other than replacing
\b
on all of these by spaces or[ \(]
(and fixing the related space eating bug in issue#101) is there a good workaround?The no_date games seems to handle accented characters in the month name but my JS/RegExp is not good enough to see why it works there but not here.
Pinging in @magnusmanske since this affects a largeish part of the gender game and the no_date implementation might hold the key to fixing it.
-
reporter "był" not working makes sense because the
ł
is next to a\b
, but "była" should work. Did you come across an instance of była that didn't get highlighted? -
What happens is that any character outside of [a-zA-Z] is treated like a word break. była is therefore treated like był-a. Similarly żoną becomes ż-on-ą (see below)
-
reporter ok, so there are a couple issues at play here:
- The main one is
\b
not working properly in non-ASCII sets. I suppose it would be safer to use spaces (and include these around the span in the substitution strings, so that the space chopping issue described in#101doesn't occur). This would also prevent żoną from having its inner "on" highlighted, since "żo" (non-word char + word char) and "ną" (word char + non-word char) are matched by\b
but not by "[space]o" or "a[space]". - In this particular case, a stand-alone "był" wouldn't get highlighted because the transition from
ł
to the space next to is is not treated as a word boundary (becauseł
is not considered a word character). But the "był" in "była" does work, because the transition fromł
toa
is considered a word boundary, but in the opposite sense as what's intended here: as a word beginning (but regex doesn't differentiate word beginnings from word endings) - Finally, as I mentioned above, "była" should be recognized, because the a is followed by a space, so there's a word boundary there as well, which matches the \b. The reason this isn't happening is that regex tries the matches in order; if we simply swap był and była in the regex, or better yet, use the more compact form
była?
, where the a is optional, then była is correctly matched. But that's just a single word fixed; in general, we should do something that takes care of the faulty\b
.
Here's my suggestion: http://www.regexr.com/38vos
- The main one is
-
reporter Just submitted 3676b1d as the first step towards fixing this: including "before" and "after" groups around the match itself, and put them around the span in the replacement string. I haven't been able to test it yet because apparently the code doesn't go live immediately (I tried clearing my cache, to no avail). As soon as this is confirmed as a fix to
#101, I'll implement the suggestion I linked in my comment above. -
reporter - changed status to open
Still not working properly, but a fix is in the works.
-
reporter Suggestion for suffixes: http://www.regexr.com/38vp8
-
That would be a possible fix if we can also resolve issue
#101. (Current solution did not work as expected) -
reporter It actually works after e7d7b17 but the code hasn't been updated in the live game yet. Not sure how to force a refresh :/ (Clearing the browser cache doesn't work, it seems to be cached server-level.) Any ideas?
-
reporter Update: it's now showing the updated code for me, and indeed it works correctly :) I'll test it for a bit and if everything looks fine, I'll move to the next step and implement the changes per the links above.
-
Indeed it now looks good =)
-
reporter Ok, seems to look fine with Portuguese. I now swapped all regexes to use
\s
instead of\b
(3f1773a). As soon as this goes live, highlighting should work with words containing non-ASCII cgaracters (for instance, at the moment highlighting doesn't work at all in Russian and other languages with non-Latin charsets). If you know any such languages, please confirm that the highligting is working correctly (making sure that the main.js loaded by the browser is actually the latest one as can be seen here in the repo), so we can close this issue :) -
Works fine when I check ru and zh. This one can be closed =)
-
reporter Oh yeah, I forgot to update this issue. Besides the fix of using
\s
rather than\b
, I've been checking how the highlighting fares in the languages I understand reasonably well (pt, gl, es, fr, it, ro) and submitting several code changes to fix / improve small details. It looks pretty good to me, too. -
reporter - changed status to resolved
Fixed in 3f1773a.
- Log in to comment