Text => Dior化粧品等の輸入総代理店で
, which is indexed with the default Kuromoji analyzer settings and produces the following tokens:
dior
start: 0 end: 4 pos: 0
化粧
start: 4 end: 6 pos: 1
品等
start: 6 end: 8 pos: 2
輸入
start: 9 end: 11 pos: 4
総
start: 11 end: 12 pos: 5
代理
start: 12 end: 14 pos: 6
店
start: 14 end: 15 pos: 7
However, we noticed that when a user searched for the term Dior化粧品
, it did not produce a match (using same analyzer settings). The reason is that the search term is tokenized as such:
dior
start: 0 end: 4 pos: 0
化粧
start: 4 end: 6 pos: 1
品
start: 6 end: 7 pos: 2
Since the word cosmetics
is the Japanese term 化粧品
, it seems like the query got analyzed correctly but the piece of text produced an unexpected bigram sequence of 化粧
and 品等
Not sure if this is a valid issue due to the mix of English/Japanese in the text or my Japanese fundamentals are off here