Tokenizer giving wrong tokens for 짝짜꿍옹앙옹알달콩이

Issue #25 new
swarnim singhal created an issue

The tokens produced are out of order for the word 짝짜꿍옹앙옹알달콩이

GET /soon_test/_analyze/?pretty
{
  "analyzer":"korean",
  "text":"짝짜꿍옹앙옹알달콩이",
  "explain":true
}

======================================================================================

{
  "detail" : {
    "custom_analyzer" : true,
    "charfilters" : [ ],
    "tokenizer" : {
      "name" : "seunjeon_default_tokenizer",
      "tokens" : [
        {
          "token" : "짝짜꿍/M",
          "start_offset" : 0,
          "end_offset" : 3,
          "type" : "M",
          "position" : 0,
          "bytes" : "[ec a7 9d ec a7 9c ea bf 8d 2f 4d]",
          "positionLength" : 1
        },
        {
          "token" : "옹/N",
          "start_offset" : 3,
          "end_offset" : 4,
          "type" : "N",
          "position" : 1,
          "bytes" : "[ec 98 b9 2f 4e]",
          "positionLength" : 1
        },
        {
          "token" : "앙/N",
          "start_offset" : 4,
          "end_offset" : 5,
          "type" : "N",
          "position" : 2,
          "bytes" : "[ec 95 99 2f 4e]",
          "positionLength" : 1
        },
        {
          "token" : "옹/N",
          "start_offset" : 5,
          "end_offset" : 6,
          "type" : "N",
          "position" : 3,
          "bytes" : "[ec 98 b9 2f 4e]",
          "positionLength" : 1
        },
        {
          "token" : "알/V",
          "start_offset" : 6,
          "end_offset" : 7,
          "type" : "V",
          "position" : 4,
          "bytes" : "[ec 95 8c 2f 56]",
          "positionLength" : 1
        },
        {
          "token" : "하/V",
          "start_offset" : 9,
          "end_offset" : 10,
          "type" : "V",
          "position" : 5,
          "bytes" : "[ed 95 98 2f 56]",
          "positionLength" : 1
        },
        {
          "token" : "콩/N",
          "start_offset" : 8,
          "end_offset" : 9,
          "type" : "N",
          "position" : 6,
          "bytes" : "[ec bd a9 2f 4e]",
          "positionLength" : 1
        }
      ]
    },
    "tokenfilters" : [ ]
  }
}

The last two tokens are out of order and the second last token is not even present in the word.

Comments (0)

  1. Log in to comment