Elastic verision : elasticsearch 5.1.1
I am testing the match query result between ascii code and multibyte character to decide what analyzer to use for our upcoming deployment. I would like to know if below behavior is a default behavior. Both cases use whitespace analyzer.
When full ascii code text is indexed, usually documents that match the search terms and inverted index are returned. So if I index,
"This is a pen"
This will be analyzed as below.
{
"tokens": [
{
"token": "This",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "is",
"start_offset": 5,
"end_offset": 7,
"type": "word",
"position": 1
},
{
"token": "a",
"start_offset": 8,
"end_offset": 9,
"type": "word",
"position": 2
},
{
"token": "pen",
"start_offset": 10,
"end_offset": 13,
"type": "word",
"position": 3
}
]
}
I can query as
GET sample2/sample/_search
{
"query" : {
"match" : {
"text" : "This"
}
}
}
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.2876821,
"hits": [
{
"_index": "sample2",
"_type": "sample",
"_id": "AVk6XNwSbyuCzhs6qfpG",
"_score": 0.2876821,
"_source": {
"text": "This is a pen"
}
}
}
]
}
}
But when I index Japanese multibyte characters,
関西国際空港に着陸しました。,
The default behavior is partial match.
GET sample/sample/_search
{
"query" : {
"match" : {
"text" : "関西"
}
}
}
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.98382175,
"hits": [
{
"_index": "sample",
"_type": "sample",
"_id": "AVk6WrlobyuCzhs6qfpE",
"_score": 0.98382175,
"_source": {
"text": "関西国際空港に着陸しました。"
}
}
]
}
}
Is this default behavior for multibyte characters specifically for Japanese?