Tokenizer giving wrong tokens for 짝짜꿍옹앙옹알달콩이
Issue #25
new
The tokens produced are out of order for the word 짝짜꿍옹앙옹알달콩이
GET /soon_test/_analyze/?pretty
{
"analyzer":"korean",
"text":"짝짜꿍옹앙옹알달콩이",
"explain":true
}
======================================================================================
{
"detail" : {
"custom_analyzer" : true,
"charfilters" : [ ],
"tokenizer" : {
"name" : "seunjeon_default_tokenizer",
"tokens" : [
{
"token" : "짝짜꿍/M",
"start_offset" : 0,
"end_offset" : 3,
"type" : "M",
"position" : 0,
"bytes" : "[ec a7 9d ec a7 9c ea bf 8d 2f 4d]",
"positionLength" : 1
},
{
"token" : "옹/N",
"start_offset" : 3,
"end_offset" : 4,
"type" : "N",
"position" : 1,
"bytes" : "[ec 98 b9 2f 4e]",
"positionLength" : 1
},
{
"token" : "앙/N",
"start_offset" : 4,
"end_offset" : 5,
"type" : "N",
"position" : 2,
"bytes" : "[ec 95 99 2f 4e]",
"positionLength" : 1
},
{
"token" : "옹/N",
"start_offset" : 5,
"end_offset" : 6,
"type" : "N",
"position" : 3,
"bytes" : "[ec 98 b9 2f 4e]",
"positionLength" : 1
},
{
"token" : "알/V",
"start_offset" : 6,
"end_offset" : 7,
"type" : "V",
"position" : 4,
"bytes" : "[ec 95 8c 2f 56]",
"positionLength" : 1
},
{
"token" : "하/V",
"start_offset" : 9,
"end_offset" : 10,
"type" : "V",
"position" : 5,
"bytes" : "[ed 95 98 2f 56]",
"positionLength" : 1
},
{
"token" : "콩/N",
"start_offset" : 8,
"end_offset" : 9,
"type" : "N",
"position" : 6,
"bytes" : "[ec bd a9 2f 4e]",
"positionLength" : 1
}
]
},
"tokenfilters" : [ ]
}
}
The last two tokens are out of order and the second last token is not even present in the word.