Problem Context
I am trying to use Elasticsearch to implement user search in a platform. However its taking way too long to search. usually around 100 ms - 200 ms. I am trying to achieve something closer to 40 ms maximum.
My cluster has 4 data nodes and 3 dedicated master nodes
I created my users index (having around 60 million records) with the default number of shards (5), with 1 replica and with the following mapping:
{
"settings": {
"analysis": {
"analyzer": {
"name_ar_analyzer": {
"tokenizer": "standard",
"filter": ["arabic_normalization","trim"]
},
"name_full_ar_analyzer": {
"tokenizer": "keyword",
"filter": ["arabic_normalization", "trim"]
},
"name_en_analyzer": {
"tokenizer": "standard",
"filter": ["lowercase", "asciifolding", "trim"]
},
"name_full_en_analyzer": {
"tokenizer": "keyword",
"filter": ["lowercase", "asciifolding", "trim"]
}
}
}
},
"mappings": {
"dynamic": false,
"properties": {
"name": {
"type": "text",
"fields": {
"full" : {
"type": "text",
"analyzer": "name_full_en_analyzer"
}
},
"analyzer": "name_en_analyzer"
},
"picture": {
"type": "text",
"index": false
},
"@timestamp": {
"type": "date"
}
}
}
}
I am using this search query
{
"query": {
"bool": {
"must": [
{
"constant_score": {
"filter": {
"match": {
"name": {
"query": "John Do",
"fuzziness": "auto"
}
}
},
"boost": 1
}
}
],
"should": [
{
"constant_score": {
"filter": {
"match": {
"name.full": {
"query": "John Do"
}
}
},
"boost": 10
}
},
{
"constant_score": {
"filter": {
"match": {
"name": {
"query": "John Do",
"fuzziness": "auto",
"operator": "and"
}
}
},
"boost": 5
}
},
{
"constant_score": {
"filter": {
"match_phrase_prefix": {
"name": {
"query": "John Do",
"slop": 1,
"max_expansions": 5
}
}
},
"boost": 10
}
}
]
}
}
}
the reasoning of this query is as follows
-
everything is in a "constant_score" to bypass TF - IDF because when you search for "john", intuitively, it should match people who's accounts are "John" not "John John John".
-
the should contains a match boost for matching the name.full field exactly because for query "John do", "John do" is a better match than results with the name: "john John" or "John"
-
the second should condition is because for query "John Doe", matches with both terms are more important than matches with one of them like "John smith" and "Jane doe"
-
the third should condition (match_phrase_prefix), is to allow us to get results like "John Doe" for the query "John D" or "John Do".
so basically, full exact matches, and matched phrases (even if uncompleted) with are more important than random documents that have one match term multiple times.
Any Advice ?
- Is the query itself a reason for the slow performance ? can it be written in a better way that achieves the wanted results ?
- is there anything in the index settings that should be changed for better performance ?
- Is the cluster suited for such a performance expectation ?