Profiler - How to Identify Bottlenecks when Total Search Time is Bigger than Shard Times

Michael_Sander · March 15, 2021, 9:51pm

Hi,

I am running a complex search, it takes a while to complete, and I am trying to figure out why. I ran the search through the profiler, and it looks like the total search time, is much greater than the individual index times.

My setup:

Elasticsearch 7.10.1
Cluster has 40 nodes, each with an 8 core CPU, and 64 GB of memory, with 31GB devoted to heap.
Roughly 40 shards... most nodes only have a single shard.
The shards are big, ranging from 20GB to 60GB.
Searches tend to hit all of the shards across all nodes.
Notice in the image below, shard searches are relatively fast (<1S), but the overall search is slow (>10S)

A few questions:

What exactly is the cumulative time? Is it all of the individual times summed up, or does it represent the total time of the search?
Why would the total time be so high when the individual times are so low?
How does one debug slow searches (>10s) when the individual shard searches are fast (<1 sec)?
Possible solution: Am I limiting myself in any way by having a single shard on a node? Would splitting up the shards so there are more shards per node increase parallelism in any way?

Christian_Dahlqvist · March 18, 2021, 6:05am

What load is the cluster under when you run the query you profiled? Is it possible the search queue was building up on some of the nodes in the cluster?

Each individual shard queried is processed in a single thread, so the shard size will affect latency and the number of shards will determine the level of parallelism possible. If you just send a single or possibly a few queries in parallel to the cluster it is possible that increasing the number of shards would result in faster performance as more threads could be involved. This however assumes that CPU and not disk performance is the limiting factor.

It is however important to optimize while running the level of query concurrency you expect in production as the optimum might change if requests start queueing up on the nodes due to high concurrency or other bottlenecks being encountered.

system · April 15, 2021, 6:06am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
A question about the Profile API Elasticsearch	1	214	August 11, 2022
Search in many indices is too slow Elasticsearch	3	365	February 28, 2019
The same query is really slow 20% of the time Elasticsearch	11	1712	October 14, 2019
Diagnosing the performance bottleneck in search Elasticsearch	2	1158	January 11, 2017
Kibana search profiler accuracy Elasticsearch	4	313	November 17, 2021

Profiler - How to Identify Bottlenecks when Total Search Time is Bigger than Shard Times

Related topics