You are using the edge_ngram token filter. Let's see how your analyzer treats your query string "ho"
. Assuming your index is called my_index
:
GET my_index/_analyze
{
"text": "ho",
"analyzer": "autocomplete"
}
The response shows you that the output of your analyzer would be two tokens at position 0:
{
"tokens": [
{
"token": "h",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "ho",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
}
]
}
What does Elasticsearch do with a query for two tokens at the same position? It treat's the query as an "OR", even if you use a type "phrase"
. You can see that from the output of the validate API (which shows you the Lucene query that your query was written into):
GET my_index/_validate/query?rewrite=true
{
"query": {
"match": {
"name": {
"query": "ho",
"type": "phrase"
}
}
}
}
Because both your query and your document have an h
at position 0, the document is going to be a hit.
Now, how to solve this? Instead of the edge_ngram token filter, you could use the edge_ngram tokenizer. This tokenizer increments the position of every token it outputs.
So, if you create your index like this instead:
PUT my_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"tokenizer": {
"autocomplete_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "autocomplete_tokenizer",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"name": {
"type": "string",
"analyzer": "autocomplete"
}
}
}
}
}
You will see that this query is no longer a hit:
GET my_index/_search
{
"query": {
"match": {
"name": {
"query": "ho",
"type": "phrase"
}
}
}
}
But for example this one is:
GET my_index/_search
{
"query": {
"match": {
"name": {
"query": "he",
"type": "phrase"
}
}
}
}