Profiler - How to Identify Bottlenecks when Total Search Time is Bigger than Shard Times

Hi,

I am running a complex search, it takes a while to complete, and I am trying to figure out why. I ran the search through the profiler, and it looks like the total search time, is much greater than the individual index times.

My setup:

  • Elasticsearch 7.10.1
  • Cluster has 40 nodes, each with an 8 core CPU, and 64 GB of memory, with 31GB devoted to heap.
  • Roughly 40 shards... most nodes only have a single shard.
  • The shards are big, ranging from 20GB to 60GB.
  • Searches tend to hit all of the shards across all nodes.
  • Notice in the image below, shard searches are relatively fast (<1S), but the overall search is slow (>10S)

A few questions:

  • What exactly is the cumulative time? Is it all of the individual times summed up, or does it represent the total time of the search?
  • Why would the total time be so high when the individual times are so low?
  • How does one debug slow searches (>10s) when the individual shard searches are fast (<1 sec)?
  • Possible solution: Am I limiting myself in any way by having a single shard on a node? Would splitting up the shards so there are more shards per node increase parallelism in any way?

What load is the cluster under when you run the query you profiled? Is it possible the search queue was building up on some of the nodes in the cluster?

Each individual shard queried is processed in a single thread, so the shard size will affect latency and the number of shards will determine the level of parallelism possible. If you just send a single or possibly a few queries in parallel to the cluster it is possible that increasing the number of shards would result in faster performance as more threads could be involved. This however assumes that CPU and not disk performance is the limiting factor.

It is however important to optimize while running the level of query concurrency you expect in production as the optimum might change if requests start queueing up on the nodes due to high concurrency or other bottlenecks being encountered.