Gender game: Highlighting regex not working as expected

Issue #79 resolved
Waldir Pimenta created an issue

Despite the code (é|foi) (um|uma|o|a) in the regex to highlight Portuguese pronouns in the gender game, the phrases "é um" (e.g. in Q16498338, pt:Bino Farias) or "é o" aren't recognized. Instances of "foi um", and "foi o" are correctly recognized, however. Could this be an encoding issue?

Comments (24)

  1. Waldir Pimenta reporter

    Oh, I think I know what's going on here. Accented characters don't count (as they should) as word characters, so the word boundary \b followed by é doesn't match. See this SO answer. I guess replacing the first \b with a space should provide a sensible workaround.

  2. Lokal Profil

    Hum I'm thinking that the regexp should maybe be

    [ \(](ele|ela|(é|foi)( (um|uma|o|a)))\b
    

    in order to not mark single "foi" and allow preceding parenthesis. You can try the following sting with the different regexps:

    Foi o rei do lugare com o nome Foi Grande (é uma mentira).uma mentira).

  3. Waldir Pimenta reporter

    The single "foi" is there on purpose, since I came across many articles that omit the pronoun in the phrase describing who the person is. So highlighting "foi" or "é" is useful to quickly locate the noun that describes the person (usually an occupation), which most of the time provides information on their gender.

    As for parenthesis... is there really a need for it? I've looked at many hundreds of article intros in the past few days and haven't come across a single instance of "foi" or "é" inside a parenthesis...

  4. Lokal Profil

    Ah. In that case never mind. I was mainly suggesting thins which might bring the RegExp closer to the original behaviour. Does the Spanish RegExp (and others using characters outside of [a-zA-Z]) work correctly?

  5. Waldir Pimenta reporter

    I haven't tried any of the other languages, but I'm willing to bet "él" isn't being highlighted for the very same reason (\b followed by é). As soon as a solution is decided for #101, the Spanish regex should be changed too.

  6. Lokal Profil

    Just tested polish and found that it does not highlight był/była. I would guess that highlighting breaks for any string containing characters outside of [a-zA-Z] which today is 5/11 of the languages with gender markers. Other than replacing \b on all of these by spaces or [ \(] (and fixing the related space eating bug in issue #101) is there a good workaround?

    The no_date games seems to handle accented characters in the month name but my JS/RegExp is not good enough to see why it works there but not here.

    Pinging in @magnusmanske since this affects a largeish part of the gender game and the no_date implementation might hold the key to fixing it.

    Screenshot from 2014-06-09 18:00:25.png

  7. Waldir Pimenta reporter

    "był" not working makes sense because the ł is next to a \b, but "była" should work. Did you come across an instance of była that didn't get highlighted?

  8. Lokal Profil

    What happens is that any character outside of [a-zA-Z] is treated like a word break. była is therefore treated like był-a. Similarly żoną becomes ż-on-ą (see below)

    Screenshot from 2014-06-10 13:26:38.png

  9. Waldir Pimenta reporter

    ok, so there are a couple issues at play here:

    • The main one is \b not working properly in non-ASCII sets. I suppose it would be safer to use spaces (and include these around the span in the substitution strings, so that the space chopping issue described in #101 doesn't occur). This would also prevent żoną from having its inner "on" highlighted, since "żo" (non-word char + word char) and "ną" (word char + non-word char) are matched by \b but not by "[space]o" or "a[space]".
    • In this particular case, a stand-alone "był" wouldn't get highlighted because the transition from ł to the space next to is is not treated as a word boundary (because ł is not considered a word character). But the "był" in "była" does work, because the transition from ł to a is considered a word boundary, but in the opposite sense as what's intended here: as a word beginning (but regex doesn't differentiate word beginnings from word endings)
    • Finally, as I mentioned above, "była" should be recognized, because the a is followed by a space, so there's a word boundary there as well, which matches the \b. The reason this isn't happening is that regex tries the matches in order; if we simply swap był and była in the regex, or better yet, use the more compact form była?, where the a is optional, then była is correctly matched. But that's just a single word fixed; in general, we should do something that takes care of the faulty \b.

    Here's my suggestion: http://www.regexr.com/38vos

  10. Waldir Pimenta reporter

    Just submitted 3676b1d as the first step towards fixing this: including "before" and "after" groups around the match itself, and put them around the span in the replacement string. I haven't been able to test it yet because apparently the code doesn't go live immediately (I tried clearing my cache, to no avail). As soon as this is confirmed as a fix to #101, I'll implement the suggestion I linked in my comment above.

  11. Waldir Pimenta reporter

    It actually works after e7d7b17 but the code hasn't been updated in the live game yet. Not sure how to force a refresh :/ (Clearing the browser cache doesn't work, it seems to be cached server-level.) Any ideas?

  12. Waldir Pimenta reporter

    Update: it's now showing the updated code for me, and indeed it works correctly :) I'll test it for a bit and if everything looks fine, I'll move to the next step and implement the changes per the links above.

  13. Waldir Pimenta reporter

    Ok, seems to look fine with Portuguese. I now swapped all regexes to use \s instead of \b (3f1773a). As soon as this goes live, highlighting should work with words containing non-ASCII cgaracters (for instance, at the moment highlighting doesn't work at all in Russian and other languages with non-Latin charsets). If you know any such languages, please confirm that the highligting is working correctly (making sure that the main.js loaded by the browser is actually the latest one as can be seen here in the repo), so we can close this issue :)

  14. Waldir Pimenta reporter

    Oh yeah, I forgot to update this issue. Besides the fix of using \s rather than \b, I've been checking how the highlighting fares in the languages I understand reasonably well (pt, gl, es, fr, it, ro) and submitting several code changes to fix / improve small details. It looks pretty good to me, too.

  15. Log in to comment