Elastic search performance issues

prateekp · July 23, 2025, 6:56am

Hi team
I have a production use case where i have small index few hundred megabutes like 10k docs. i have setup where i have many replicas and but single shard.
The current bottleneck seems to be fetch phase of the query, the query executes in about 10-15ms , but the fetch phase of _id translation takes addition 30ms. when i am doing profile: true , i see the fetch phase of the query is much higher overhead , any idea how to optimize further.
i already have source:false , dont have any stored_fields or fields i access . i just want to retrieve docs that matches the query , as soon as the doc matches reach 4k-8k range the latency overhead is 30ms even if lucene query execution is < 10ms . Overall this is the Elasticsearch latency , we have coordinator node on same as data node , so we are not doing extra network hops as well , in our queries we also do preference=local
any ideas or recmmendations to bring donw latency to almost raw lucene query speed

Tortoise · July 23, 2025, 10:01am

Hello @prateekp

Welcome to the community!!

Could you please share what is the query which is executed ? As this will help the community to review & suggest if it can be improved.

Also 1 way is to use profiler in kibana dev tools , to understand which part of query is taking more time for execution in order to see how that part can be improved?

Thanks!!

prateekp · July 23, 2025, 7:53pm

the query is internal query. The question is not regarding lucene query optimization. I am already doing what you suggested and figured lucene part is not the bottlneck , the fetch phase is teh bottleneck , not the UI you showes , thats for detailed profiling of lucene query . i am using the same kibana console to get metrics broken down though

prateekp · July 23, 2025, 7:57pm

example query

POST /index-name/_search?request_cache=false&preference=_local
{
"profile" : true,
"query": {
"bool": {
"filter": [
{
"terms": {
"format_id": [
4
]
}
},
{
"terms": {
"d_enum_ids": [
1,
2
]
}
},
{
"terms": {
"s_enum_ids": [
1,
5
]
}
},
{
"terms": {
"d_ids": [
1,
2
]
}
},
{
"range": {
"start_time_ms": {
"lte": 1752713100000
}
}
},
{
"range": {
"end_time_ms": {
"gte": 1752713100000
}
}
}
],
"must_not": [
{
"bool": {
"filter": [
{
"terms": {
"status_id": [
2,
3
]
}
},
{
"range": {
"effective_timestamp_ms": {
"gte": 1752713100000
}
}
}
]
}
}
]
}
},
"size": 4000,
"sort": [
{
"priority_id": {
"order": "asc"
}
},
{
"score": {
"order": "desc"
}
}
],
"_source": false
}

Christian_Dahlqvist · July 24, 2025, 2:31am

So you have 1 primary shard and a number of replicas that ensures all data nodes have a copy. Is that correct?

Is that index being actictively updated or modified? If so, can you try querying while pausing any modifications and see if that makes any difference?

Is this the only index in the cluster? If not, how much data does each data node hold and how much RAM and heap does each data node have?

If I remember correctly the local preference only applies to the node that receives the request and not multiple nodes located on the same host. I would therefore recommend trying to send requests directly to the data nodes and see if this makes a difference.

prateekp · July 24, 2025, 7:17am

Could you elaborate a bit on why as well for me to understand the reason.
yes we have 1 shard and mamny replicas to handle high QPS , we have some writes, but that is not relevant as much and cant be changes based on update and freshness requirement , we are trying to reduce. The question here is not latency of the query (lucene) , but latency of the fetch phase , the id translation done after the query is done.
So the above recommendation are good for overall optimization , but i want to focus on a very specific fetch phase optimization not general guidelines on how to make query performant

prateekp · July 24, 2025, 7:18am

btw , most of the testing and performance i am doing is for a somewhat n=static snapshot , its being updated but a very low write volume.

Christian_Dahlqvist · July 24, 2025, 11:33am

Which version of Elasticsearch are you using?

Did sending queries directly to data nodes make any difference?

prateekp · July 25, 2025, 5:12am

we are using i think 8.6 . in our case coordinator and data nodes reside on same instance so yes we are directly sending query to data nodes ? or do you mean something else ?

Christian_Dahlqvist · July 25, 2025, 5:24am

Do you have dedicated coordinating nodes that do not hold any data? If you do, try bypassing these and send requests directly to the data nodes.

prateekp · July 25, 2025, 5:53pm

no we removed dedicated coordinating nodes, our coordinating nodes areon the datanode itself and hence we do preference_local to avoid any network hops , so we dsend queries directly to data nodes

RainTown · July 25, 2025, 7:15pm

The maybe unintended implication is that fetch cost isn’t growing as you’d expect? Is that correct?

Test and share how the query/fetch times compare as you grow the size parameter? Maybe run tests with that in 1k-10k range, in steps of 100 or 250, and graph the results.

prateekp · July 26, 2025, 4:35am

not really, may be i oversimplified. For smaller topK , lucene query is dominating factor so our latencies are within 20ms and 8-12ms are lucence . But when i sweep over queries where i increase topK , i see increase (close to sub linear or sometimes linear when i siable caching

Topic		Replies	Views
Query vs. fetch times Elasticsearch	4	9231	July 6, 2017
Random very slow queries Elasticsearch	8	2549	July 6, 2017
Need help with performance insights Elasticsearch	1	340	July 6, 2017
Query timings breakdown and performance issues Elasticsearch	4	386	July 6, 2017
Optimising Elasticsearch Search Queries Elasticsearch	10	527	August 7, 2018

Elastic search performance issues

Related topics