Date game: add month names in more languages

Issue #63 new
Andre Engels created an issue

It would be nice if the code recognized month names (and thus full dates rather than just years) in more languages. This is especially useful for languages where I cannot recognize which name it is myself, like Finnish, Polish or Czech - it would save loading two pages in those cases.

Comments (30)

  1. Magnus Manske repo owner

    Good eye Waldir :-)

    Could be an issue with additional words like "de la" between number and month name.

  2. Diego Heras

    Please add this for Spanish language:

                            // Full name
                            var r = new RegExp('\\b(\\d{1,2}) de ('+name+') de (\\d{3,4})' ,'gi') ;
                            h = h.replace ( r , "<span class='highlight' day='\$1' month='"+num+"' year='\$3'>\$1.&nbsp;\$2&nbsp;\$3</span>" ) ;
    
  3. Waldir Pimenta

    Instead of reinventing a date parser from scratch, I'd suggest using the well known and feature-rich moment.js, which provides a stable parser for dates in various languages (the moment-with-langs.min.js is just 35.4 kb).

    The Localized formats LL and perhaps ll seem to be the most common for Wikipedia articles. I tried this in the javascript shell:

    > load("http://momentjs.com/downloads/moment-with-langs.min.js");
    > moment.lang("pt");
    > moment("16 de janeiro de 1987", "LL");
    Fri Jan 16 1987 00:00:00 GMT+0000
    
  4. Andre Engels reporter

    Ah, I see what is going on. The dates in Finnish and Czech have a dot after the number, like is also often seen in German. I guess that would be one variation to add. Another one I noticed that doesn't work is Chinese (and Japanese). Their style is 1551年9月30日 (for 30 September 1551). I'll add other non-recognized variations here when I find them.

  5. Magnus Manske repo owner

    @waldyrious moments.js looks powerful, but I don't think it does what is needed here, which is finding dates in free-text. Once a date is found, we could use moments.js to parse it, but at that point we already have enough knowledge about the date that parsing is almost superfluous.

    @diegoheras754 I've added those lines.

  6. Andre Engels reporter

    Ok, immediately another issue (another possible cause for these languages to go wrong): Cases. Czech date "1. prosince 1888", to be recognized as "1. prosinec".

  7. Andre Engels reporter

    Ignore my remark about the dots, those are definitely recognized, it must be the case issue.

  8. Andre Engels reporter

    Czech issue seems to be resolved, you're mighty quick. Diego's proposal for Spanish doesn't seem to work yet though. For Finnish, -ta is appended at the end of the month name (so joulukuuta is used for joulukuu (December))

  9. Andre Engels reporter

    Polish seems more complicated: października for październik, but grudnia for grudzień

  10. calliopejen

    In French, the first of the month can be represented as "1er" (rather than 1, e.g. "1er janvier 1962"); this is not recognized.

  11. Lokal Profil

    @andreengels Could the issue with Norwegian (Bokmal) be the difference between the language code (nb) and the sitecode (nowiki)?

    I'm unsure how getAliasesForLanguage() works (or where it lives) but the language it is being sent is "no" not "nb" (line 68).

  12. TMg_

    It seems some languages add additional characters to their month names when they are used as born/died dates, e.g. "8. elokuuta 1979" in fiwiki.

  13. Jan Dudík

    For czech are dates used often in genitive form: 01 - leden / ledna 02 - únor / února 03 - březen / března 04 - duben / dubna 05 - květen / května 06 - červen / června 07 - červenec / července 08 - srpen / srpna 09 - září (both nominative and genitive) 10 - říjen / října 11 - listopad / listopadu 12 - prosinec / prosince

    in slovak is situation with genitive similar: 01 - január / januára 02 - február / februára 03 - marec / marca 04 - apríl / apríla 05 - máj / mája 06 - jún / júna 07 - júl / júla 08 - august / augusta 09 - semtember / septembra 10 - október / októbra 11 - november / novembra 12 - december / decembra

  14. Hsarrazin

    had trouble with Polish, Finnish, Hungarian... but also Spanish (day number "31 de", "27 de"... ukrainian...

    and also (another type of pb) with birth year on some items (like Q16651397), where dates are libelled (1845—1910) within a russian text... :(

  15. Hsarrazin

    oh, and could you please filter out the ISBN numbers ;) - they always begin with ISBN and are 10 digits with dots, or 13 digits with dots ?

  16. Paul Kaganer

    Please add month recognize for lithuanian (lt) language.

    in wikipedia articles is used next date format: YYYY m. MONTHNAME DD with month names: 1 - sausio 2 - vasario 3 - kovo 4 - balandžio 5 - gegužės 6 - birželio 7 - liepos 8 - rugpjūčio 9 - rugsėjo 10 - spalio 11 - lapkričio 12 - gruodžio

  17. Paul Kaganer

    For Armenian (hy) language. in wikipedia articles is used next date format: YYYY, MONTHNAME DD with month names: 1 - հունվարի 2 - փետրվարի 3 - մարտի 4 - ապրիլի 5 - մայիսի 6 - հունիսի 7 - հուլիսի 8 - օգոստոսի 9 - սեպտեմբերի 10 - հոկտեմբերի 11 - նոյեմբերի 12 - դեկտեմբերի

  18. 96187

    Cantonese, Gan and Wu use a different character for the day, 號 (yue and gan-hant) and 号 (gan-hans and wuu) instead of 日. I would suggest changing the first two lines of the Chinese section to:

    r = new RegExp('\\b(\\d{3,4})年(\\d{1,2})月(\\d{1,2})(日|號|号)','g') ;
    h = h.replace ( r , "<span class='highlight' day='\$3' month='\$2' year='\$1'>\$1年<span/>\$2月<span/>\$3\$4</span>" ) ;
    

    Korean uses a format very similar to Chinese/Japanese, except there are spaces between each part and Hangul is used instead of Chinese characters (년 instead of 年, 월 instead of 月 and 일 instead of 日). I would suggest adding the following under the Chinese section:

    r = new RegExp('\\b(\\d{3,4})년 (\\d{1,2})월 (\\d{1,2})일','g') ;
    h = h.replace ( r , "<span class='highlight' day='\$3' month='\$2' year='\$1'>\$1년 <span/>\$2월 <span/>\$3일</span>" ) ;
    r = new RegExp('\\b(\\d{3,4})년 (\\d{1,2})월','g') ;
    h = h.replace ( r , "<span class='highlight' day='' month='\$2' year='\$1'>\$1년 <span/>\$2월</span>" ) ;
    r = new RegExp('\\b(\\d{3,4})년([^<])','g') ;
    h = h.replace ( r , "<span class='highlight' day='' month='' year='\$1'>\$1년</span>\$2" ) ;
    

    I'm not sure what the <span/>s do, so I left them there. It would probably be possible to merge the two sections if someone really wanted to, but I imagine it's more readable if it's left separate.

  19. 96187

    Maltese has the word ta' (of) between the number and month name. Something like the following (copied from the Esperanto one) should work:

    r = new RegExp('\\b(\\d{1,2}) ta\' ('+name+') (\\d{3,4})' ,'gi') ;
    h = h.replace ( r , "<span class='highlight' day='\$1' month='"+num+"' year='\$3'>\$1 ta' \$2 \$3</span>" ) ;
    
  20. Log in to comment