Gender game: Highlighting regex not working as expected

Issue #79 resolved

Waldir Pimenta created an issue 2014-06-04

Despite the code (é|foi) (um|uma|o|a) in the regex to highlight Portuguese pronouns in the gender game, the phrases "é um" (e.g. in Q16498338, pt:Bino Farias) or "é o" aren't recognized. Instances of "foi um", and "foi o" are correctly recognized, however. Could this be an encoding issue?

Comments (24)

Waldir Pimenta reporter
- changed title to Gender game: Highlighting regex not working as expected
- 2014-06-04T17:38:47+00:00
Lokal Profil
Looks like there is some bad escaping based on this quick test (using JShint)
- 2014-06-04T18:40:23+00:00
Lokal Profil
Strike that since changing é to e in text and regexp gives correct match
- 2014-06-04T18:51:20+00:00
Waldir Pimenta reporter
Oh, I think I know what's going on here. Accented characters don't count (as they should) as word characters, so the word boundary \b followed by é doesn't match. See this SO answer. I guess replacing the first \b with a space should provide a sensible workaround.
- 2014-06-04T19:32:10+00:00
Waldir Pimenta reporter
- changed status to resolved
Fixed in 3187ae1.
- 2014-06-05T15:53:20+00:00
Lokal Profil
Hum I'm thinking that the regexp should maybe be
```
[ \(](ele|ela|(é|foi)( (um|uma|o|a)))\b
```
in order to not mark single "foi" and allow preceding parenthesis. You can try the following sting with the different regexps:

Foi o rei do lugare com o nome Foi Grande (é uma mentira).uma mentira).
- 2014-06-05T16:09:48+00:00
Waldir Pimenta reporter
The single "foi" is there on purpose, since I came across many articles that omit the pronoun in the phrase describing who the person is. So highlighting "foi" or "é" is useful to quickly locate the noun that describes the person (usually an occupation), which most of the time provides information on their gender.

As for parenthesis... is there really a need for it? I've looked at many hundreds of article intros in the past few days and haven't come across a single instance of "foi" or "é" inside a parenthesis...
- 2014-06-06T03:12:35+00:00
Lokal Profil
Ah. In that case never mind. I was mainly suggesting thins which might bring the RegExp closer to the original behaviour. Does the Spanish RegExp (and others using characters outside of [a-zA-Z]) work correctly?
- 2014-06-09T07:42:54+00:00
Waldir Pimenta reporter
I haven't tried any of the other languages, but I'm willing to bet "él" isn't being highlighted for the very same reason (\b followed by é). As soon as a solution is decided for ~~#101~~, the Spanish regex should be changed too.
- 2014-06-09T14:22:57+00:00
Lokal Profil
Just tested polish and found that it does not highlight był/była. I would guess that highlighting breaks for any string containing characters outside of [a-zA-Z] which today is 5/11 of the languages with gender markers. Other than replacing \b on all of these by spaces or [ \(] (and fixing the related space eating bug in issue ~~#101~~) is there a good workaround?

The no_date games seems to handle accented characters in the month name but my JS/RegExp is not good enough to see why it works there but not here.

Pinging in @magnusmanske since this affects a largeish part of the gender game and the no_date implementation might hold the key to fixing it.
- 2014-06-09T16:14:02+00:00
Waldir Pimenta reporter
"był" not working makes sense because the ł is next to a \b, but "była" should work. Did you come across an instance of była that didn't get highlighted?
- 2014-06-09T19:36:33+00:00
Lokal Profil
What happens is that any character outside of [a-zA-Z] is treated like a word break. była is therefore treated like był-a. Similarly żoną becomes ż-on-ą (see below)
- 2014-06-10T11:30:47+00:00
Waldir Pimenta reporter
ok, so there are a couple issues at play here:
- The main one is \b not working properly in non-ASCII sets. I suppose it would be safer to use spaces (and include these around the span in the substitution strings, so that the space chopping issue described in ~~#101~~ doesn't occur). This would also prevent żoną from having its inner "on" highlighted, since "żo" (non-word char + word char) and "ną" (word char + non-word char) are matched by \b but not by "[space]o" or "a[space]".
- In this particular case, a stand-alone "był" wouldn't get highlighted because the transition from ł to the space next to is is not treated as a word boundary (because ł is not considered a word character). But the "był" in "była" does work, because the transition from ł to a is considered a word boundary, but in the opposite sense as what's intended here: as a word beginning (but regex doesn't differentiate word beginnings from word endings)
- Finally, as I mentioned above, "była" should be recognized, because the a is followed by a space, so there's a word boundary there as well, which matches the \b. The reason this isn't happening is that regex tries the matches in order; if we simply swap był and była in the regex, or better yet, use the more compact form była?, where the a is optional, then była is correctly matched. But that's just a single word fixed; in general, we should do something that takes care of the faulty \b.
Here's my suggestion: http://www.regexr.com/38vos
- 2014-06-10T13:27:17+00:00
Waldir Pimenta reporter
Just submitted 3676b1d as the first step towards fixing this: including "before" and "after" groups around the match itself, and put them around the span in the replacement string. I haven't been able to test it yet because apparently the code doesn't go live immediately (I tried clearing my cache, to no avail). As soon as this is confirmed as a fix to ~~#101~~, I'll implement the suggestion I linked in my comment above.
- 2014-06-10T13:40:44+00:00
Waldir Pimenta reporter
- changed status to open
Still not working properly, but a fix is in the works.
- 2014-06-10T13:41:24+00:00
Waldir Pimenta reporter
Suggestion for suffixes: http://www.regexr.com/38vp8
- 2014-06-10T13:45:38+00:00
Lokal Profil
That would be a possible fix if we can also resolve issue ~~#101~~. (Current solution did not work as expected)
- 2014-06-11T07:16:38+00:00
Waldir Pimenta reporter
It actually works after e7d7b17 but the code hasn't been updated in the live game yet. Not sure how to force a refresh :/ (Clearing the browser cache doesn't work, it seems to be cached server-level.) Any ideas?
- 2014-06-11T11:20:38+00:00
Waldir Pimenta reporter
Update: it's now showing the updated code for me, and indeed it works correctly :) I'll test it for a bit and if everything looks fine, I'll move to the next step and implement the changes per the links above.
- 2014-06-11T11:29:22+00:00
Lokal Profil
Indeed it now looks good =)
- 2014-06-11T13:12:43+00:00
Waldir Pimenta reporter
Ok, seems to look fine with Portuguese. I now swapped all regexes to use \s instead of \b (3f1773a). As soon as this goes live, highlighting should work with words containing non-ASCII cgaracters (for instance, at the moment highlighting doesn't work at all in Russian and other languages with non-Latin charsets). If you know any such languages, please confirm that the highligting is working correctly (making sure that the main.js loaded by the browser is actually the latest one as can be seen here in the repo), so we can close this issue :)
- 2014-06-11T13:12:54+00:00
Lokal Profil
Works fine when I check ru and zh. This one can be closed =)
- 2014-07-02T18:14:41+00:00
Waldir Pimenta reporter
Oh yeah, I forgot to update this issue. Besides the fix of using \s rather than \b, I've been checking how the highlighting fares in the languages I understand reasonably well (pt, gl, es, fr, it, ro) and submitting several code changes to fix / improve small details. It looks pretty good to me, too.
- 2014-07-02T19:43:06+00:00
Waldir Pimenta reporter
- changed status to resolved
Fixed in 3f1773a.
- 2014-07-02T19:44:08+00:00
Log in to comment

Assignee: –

Type: bug

Priority: minor

Status: resolved

Votes: 0

Watchers: 2