Date game: add month names in more languages

Issue #63 new

Andre Engels created an issue 2014-06-02

It would be nice if the code recognized month names (and thus full dates rather than just years) in more languages. This is especially useful for languages where I cannot recognize which name it is myself, like Finnish, Polish or Czech - it would save loading two pages in those cases.

Comments (30)

Waldir Pimenta
Looks like it already should do this, see https://bitbucket.org/magnusmanske/wikidata-game/src/81fdd926221198949b0a5c9eb99ae9f227166e8e/public_html/main.js?at=master#cl-56. Am I missing something?
- 2014-06-02T20:41:43+00:00
Magnus Manske repo owner
Good eye Waldir :-)

Could be an issue with additional words like "de la" between number and month name.
- 2014-06-02T21:03:03+00:00

Diego Heras

Please add this for Spanish language:

                        // Full name
                        var r = new RegExp('\\b(\\d{1,2}) de ('+name+') de (\\d{3,4})' ,'gi') ;
                        h = h.replace ( r , "<span class='highlight' day='\$1' month='"+num+"' year='\$3'>\$1.&nbsp;\$2&nbsp;\$3</span>" ) ;

2014-06-02T21:16:32+00:00

Waldir Pimenta
Instead of reinventing a date parser from scratch, I'd suggest using the well known and feature-rich moment.js, which provides a stable parser for dates in various languages (the moment-with-langs.min.js is just 35.4 kb).

The Localized formats LL and perhaps ll seem to be the most common for Wikipedia articles. I tried this in the javascript shell:
```
> load("http://momentjs.com/downloads/moment-with-langs.min.js");
> moment.lang("pt");
> moment("16 de janeiro de 1987", "LL");
Fri Jan 16 1987 00:00:00 GMT+0000
```
- 2014-06-02T21:36:47+00:00
Andre Engels reporter
Ah, I see what is going on. The dates in Finnish and Czech have a dot after the number, like is also often seen in German. I guess that would be one variation to add. Another one I noticed that doesn't work is Chinese (and Japanese). Their style is 1551年9月30日 (for 30 September 1551). I'll add other non-recognized variations here when I find them.
- 2014-06-02T22:20:18+00:00
Magnus Manske repo owner
@waldyrious moments.js looks powerful, but I don't think it does what is needed here, which is finding dates in free-text. Once a date is found, we could use moments.js to parse it, but at that point we already have enough knowledge about the date that parsing is almost superfluous.

@diegoheras754 I've added those lines.
- 2014-06-02T22:22:22+00:00
Andre Engels reporter
Ok, immediately another issue (another possible cause for these languages to go wrong): Cases. Czech date "1. prosince 1888", to be recognized as "1. prosinec".
- 2014-06-02T22:22:46+00:00
Andre Engels reporter
Ignore my remark about the dots, those are definitely recognized, it must be the case issue.
- 2014-06-02T22:27:57+00:00
Andre Engels reporter
Czech issue seems to be resolved, you're mighty quick. Diego's proposal for Spanish doesn't seem to work yet though. For Finnish, -ta is appended at the end of the month name (so joulukuuta is used for joulukuu (December))
- 2014-06-02T22:43:57+00:00
Andre Engels reporter
Also cases for Greek: found Οκτωβρίου for Οκτώβριος (October)
- 2014-06-02T22:48:13+00:00
Andre Engels reporter
Hungarian date not recognized, I assume because of the order: "1770. május 3."
- 2014-06-02T22:50:33+00:00
Andre Engels reporter
The de ... de form mentioned by Diego for Spanish also in Portuguese
- 2014-06-02T23:00:02+00:00
Andre Engels reporter
Polish seems more complicated: października for październik, but grudnia for grudzień
- 2014-06-03T00:03:50+00:00
Andre Engels reporter
Norwegian (Bokmal) seems to be missing
- 2014-06-03T06:49:12+00:00
Diego Heras
@andreengels Spanish it's working for me. Recognizes dates is Spanish but show English format.
- 2014-06-03T07:07:31+00:00
calliopejen
In French, the first of the month can be represented as "1er" (rather than 1, e.g. "1er janvier 1962"); this is not recognized.
- 2014-06-04T05:01:34+00:00
Lokal Profil
@andreengels Could the issue with Norwegian (Bokmal) be the difference between the language code (nb) and the sitecode (nowiki)?

I'm unsure how getAliasesForLanguage() works (or where it lives) but the language it is being sent is "no" not "nb" (line 68).
- 2014-06-04T18:23:26+00:00
Waldir Pimenta
Issue ~~#71~~ was marked as a duplicate of this issue.
- 2014-06-05T15:51:05+00:00
TMg_
It seems some languages add additional characters to their month names when they are used as born/died dates, e.g. "8. elokuuta 1979" in fiwiki.
- 2014-06-05T17:02:49+00:00
Waldir Pimenta
Issue ~~#100~~ was marked as a duplicate of this issue.
- 2014-06-07T13:14:56+00:00
Waldir Pimenta
- changed title to Date game: add month names in more languages
- 2014-06-08T14:28:40+00:00
Waldir Pimenta
Issue ~~#102~~ was marked as a duplicate of this issue.
- 2014-06-08T14:28:52+00:00
Lokal Profil
Norwegian (Bokmal) resolved through 15eaa83 which should also solve some rarer languages (incl. simple.wp)
- 2014-06-09T07:36:11+00:00
Jan Dudík
For czech are dates used often in genitive form: 01 - leden / ledna 02 - únor / února 03 - březen / března 04 - duben / dubna 05 - květen / května 06 - červen / června 07 - červenec / července 08 - srpen / srpna 09 - září (both nominative and genitive) 10 - říjen / října 11 - listopad / listopadu 12 - prosinec / prosince

in slovak is situation with genitive similar: 01 - január / januára 02 - február / februára 03 - marec / marca 04 - apríl / apríla 05 - máj / mája 06 - jún / júna 07 - júl / júla 08 - august / augusta 09 - semtember / septembra 10 - október / októbra 11 - november / novembra 12 - december / decembra
- 2014-06-09T11:11:46+00:00
Hsarrazin
had trouble with Polish, Finnish, Hungarian... but also Spanish (day number "31 de", "27 de"... ukrainian...

and also (another type of pb) with birth year on some items (like Q16651397), where dates are libelled (1845—1910) within a russian text... :(
- 2014-06-15T17:08:30+00:00
Hsarrazin
oh, and could you please filter out the ISBN numbers ;) - they always begin with ISBN and are 10 digits with dots, or 13 digits with dots ?
- 2014-06-15T17:11:05+00:00
Paul Kaganer
Please add month recognize for lithuanian (lt) language.

in wikipedia articles is used next date format: YYYY m. MONTHNAME DD with month names: 1 - sausio 2 - vasario 3 - kovo 4 - balandžio 5 - gegužės 6 - birželio 7 - liepos 8 - rugpjūčio 9 - rugsėjo 10 - spalio 11 - lapkričio 12 - gruodžio
- 2015-09-18T11:18:16+00:00
Paul Kaganer
For Armenian (hy) language. in wikipedia articles is used next date format: YYYY, MONTHNAME DD with month names: 1 - հունվարի 2 - փետրվարի 3 - մարտի 4 - ապրիլի 5 - մայիսի 6 - հունիսի 7 - հուլիսի 8 - օգոստոսի 9 - սեպտեմբերի 10 - հոկտեմբերի 11 - նոյեմբերի 12 - դեկտեմբերի
- 2015-09-18T11:25:41+00:00

96187

Cantonese, Gan and Wu use a different character for the day, 號 (yue and gan-hant) and 号 (gan-hans and wuu) instead of 日. I would suggest changing the first two lines of the Chinese section to:

r = new RegExp('\\b(\\d{3,4})年(\\d{1,2})月(\\d{1,2})(日|號|号)','g') ;
h = h.replace ( r , "<span class='highlight' day='\$3' month='\$2' year='\$1'>\$1年<span/>\$2月<span/>\$3\$4</span>" ) ;

Korean uses a format very similar to Chinese/Japanese, except there are spaces between each part and Hangul is used instead of Chinese characters (년 instead of 年, 월 instead of 月 and 일 instead of 日). I would suggest adding the following under the Chinese section:

r = new RegExp('\\b(\\d{3,4})년 (\\d{1,2})월 (\\d{1,2})일','g') ;
h = h.replace ( r , "<span class='highlight' day='\$3' month='\$2' year='\$1'>\$1년 <span/>\$2월 <span/>\$3일</span>" ) ;
r = new RegExp('\\b(\\d{3,4})년 (\\d{1,2})월','g') ;
h = h.replace ( r , "<span class='highlight' day='' month='\$2' year='\$1'>\$1년 <span/>\$2월</span>" ) ;
r = new RegExp('\\b(\\d{3,4})년([^<])','g') ;
h = h.replace ( r , "<span class='highlight' day='' month='' year='\$1'>\$1년</span>\$2" ) ;

I'm not sure what the <span/>s do, so I left them there. It would probably be possible to merge the two sections if someone really wanted to, but I imagine it's more readable if it's left separate.

2015-10-10T16:20:49+00:00

96187

Maltese has the word ta' (of) between the number and month name. Something like the following (copied from the Esperanto one) should work:

r = new RegExp('\\b(\\d{1,2}) ta\' ('+name+') (\\d{3,4})' ,'gi') ;
h = h.replace ( r , "<span class='highlight' day='\$1' month='"+num+"' year='\$3'>\$1 ta' \$2 \$3</span>" ) ;

2015-10-11T12:23:49+00:00

Assignee: –

Type: enhancement

Priority: major

Status: new

Votes: 7

Watchers: 12