We have faced the issue with indexing and searching multilanguage Japanese/Arabic documents using kuromoji analyzer.
The problem is that the analyzer filters out all tokens in Arabic, while tokens in other scripts are preserved.
GET /_analyze
{
"analyzer" : "kuromoji",
"text" : "医療用. أعلنت الحكومة التشيكية أمس الإثنين، أنها اشترت من إسرائيل منظومة دفاع جوي من أربع"
}
{
"tokens" : [
{
"token" : "医療",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "用",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 1
}
]
}
GET /_analyze
{
"analyzer" : "kuromoji",
"text": "医療用. Чешское правительство в понедельник купило у Израиля систему обороны"
}
{
"tokens" : [
{
"token" : "医療",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "用",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 1
},
{
"token" : "чешское",
"start_offset" : 5,
"end_offset" : 12,
"type" : "word",
"position" : 2
},
{
"token" : "правительство",
"start_offset" : 13,
"end_offset" : 26,
"type" : "word",
"position" : 3
},
{
"token" : "в",
"start_offset" : 27,
"end_offset" : 28,
"type" : "word",
"position" : 4
},
{
"token" : "понедельник",
"start_offset" : 29,
"end_offset" : 40,
"type" : "word",
"position" : 5
},
{
"token" : "купило",
"start_offset" : 41,
"end_offset" : 47,
"type" : "word",
"position" : 6
},
{
"token" : "у",
"start_offset" : 48,
"end_offset" : 49,
"type" : "word",
"position" : 7
},
{
"token" : "израиля",
"start_offset" : 50,
"end_offset" : 57,
"type" : "word",
"position" : 8
},
{
"token" : "систему",
"start_offset" : 58,
"end_offset" : 65,
"type" : "word",
"position" : 9
},
{
"token" : "обороны",
"start_offset" : 66,
"end_offset" : 73,
"type" : "word",
"position" : 10
}
]
}
The problem is in specific behavior for Arabic of * kuromoji_part_of_speech
token filter