Kuromoji analyzer filters out text in Arabic

anatoly21 · September 28, 2021, 11:51am

We have faced the issue with indexing and searching multilanguage Japanese/Arabic documents using kuromoji analyzer.
The problem is that the analyzer filters out all tokens in Arabic, while tokens in other scripts are preserved.

GET /_analyze
{
  "analyzer" : "kuromoji",
  "text" : "医療用. أعلنت الحكومة التشيكية أمس الإثنين، أنها اشترت من إسرائيل منظومة دفاع جوي من أربع"
}

{
  "tokens" : [
    {
      "token" : "医療",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "用",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    }
  ]
}

GET /_analyze
{
  "analyzer" : "kuromoji",
"text": "医療用. Чешское правительство в понедельник купило у Израиля систему обороны"
}
{
  "tokens" : [
    {
      "token" : "医療",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "用",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "чешское",
      "start_offset" : 5,
      "end_offset" : 12,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "правительство",
      "start_offset" : 13,
      "end_offset" : 26,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "в",
      "start_offset" : 27,
      "end_offset" : 28,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "понедельник",
      "start_offset" : 29,
      "end_offset" : 40,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "купило",
      "start_offset" : 41,
      "end_offset" : 47,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "у",
      "start_offset" : 48,
      "end_offset" : 49,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "израиля",
      "start_offset" : 50,
      "end_offset" : 57,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "систему",
      "start_offset" : 58,
      "end_offset" : 65,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "обороны",
      "start_offset" : 66,
      "end_offset" : 73,
      "type" : "word",
      "position" : 10
    }
  ]
}

The problem is in specific behavior for Arabic of * kuromoji_part_of_speech token filter

system · October 26, 2021, 11:51am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How can I correctly index @screen_name, #hashtag and url in Japanese text? Elasticsearch	1	824	October 8, 2018
Special Character Search with kuromoji analyzer Elasticsearch	1	440	August 31, 2018
Combo analyzer - Issue with English and Japanese text being stored in same fields Elasticsearch	5	1706	July 6, 2017
Kuromoji analyzer default character/token filters Elasticsearch	4	831	January 2, 2017
Need Help with Japanese analyzer - (Kuromoji) Elasticsearch	1	363	July 6, 2017

Kuromoji analyzer filters out text in Arabic

Related topics