Possible Issue with Kuromoji Tokenization when English/Japanese are present

Bryan_Warner · June 23, 2017, 7:27pm

Text => Dior化粧品等の輸入総代理店で , which is indexed with the default Kuromoji analyzer settings and produces the following tokens:

dior
start: 0 end: 4 pos: 0
化粧
start: 4 end: 6 pos: 1
品等
start: 6 end: 8 pos: 2
輸入
start: 9 end: 11 pos: 4
総
start: 11 end: 12 pos: 5
代理
start: 12 end: 14 pos: 6
店
start: 14 end: 15 pos: 7

However, we noticed that when a user searched for the term Dior化粧品, it did not produce a match (using same analyzer settings). The reason is that the search term is tokenized as such:

dior
start: 0 end: 4 pos: 0
化粧
start: 4 end: 6 pos: 1
品  
start: 6 end: 7 pos: 2

Since the word cosmetics is the Japanese term 化粧品, it seems like the query got analyzed correctly but the piece of text produced an unexpected bigram sequence of 化粧 and 品等

Not sure if this is a valid issue due to the mix of English/Japanese in the text or my Japanese fundamentals are off here

system · July 21, 2017, 7:27pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Kuromoji tokenizers とURLリンク分解日本語による質問・議論はこちら	1	793	October 22, 2018
Kuromoji analyzer filters out text in Arabic Elasticsearch	0	193	September 28, 2021
Kuromoji tokenizers and uax_url_email Elasticsearch	0	418	October 17, 2018
Kuromoji: Tokenization of ゴロンと is Unexpected (incorrect?) Elasticsearch	2	670	February 20, 2018
Combo analyzer - Issue with English and Japanese text being stored in same fields Elasticsearch	4	1803	February 8, 2013

Possible Issue with Kuromoji Tokenization when English/Japanese are present

Related topics