ES 8.8.2 high query latency

I am encountering degraded query latency in v8. We are upgrading our cluster from 7.16.2 to 8.8.2 by standing up a new duplicate cluster with the new version and reindexing the data to it. The latency is 500ms to several seconds in v8 vs. 20-200ms in v7. To compare the query performance, a subset of the queries that the v7 cluster receives are also executed on the v8 cluster.

Why is the query latency much higher in the v8 cluster (500ms to several seconds in v8 vs 20-200 milliseconds in v7) ? Most of the slow queries are non-reproducible (i.e. the latency does not persist on a subsequent query unless if we clear the cache).

Some useful details of the clusters -

  1. ~100+ indices with sizes ranging from a few MBs up to 10s of TBs and num_of_shards ranging from 5 to 2048 and 1 replica.

  2. The v8 cluster indices have equal or more number of shards for each of the indices as compared to their corresponding indices in the v7 cluster - so as to have ~20gb per shard size in the v8 cluster for all the indices.

  3. Our ES use case is storage bound, CPU and memory load is low on ES nodes, and we don't notice load differences between two clusters.

  4. All the custom cluster settings are kept the same in both v7 and v8 clusters except the xpack.security.enabled is explicitly set to true in the v8 cluster (which is not explicitly set in the v7 cluster but the default value is true).

  5. The following default settings are overwritten to make them the same as v7 cluster -

action.destructive_requires_name: false
cluster.routing.allocation.enforce_default_tier_preference: false
cluster.routing.allocation.type: balanced
http.max_header_size: 8kb
indices.query.bool.max_clause_count: 1024
indices.query.bool.max_nested_depth: 20
search.max_async_search_response_size: -1b
thread_pool.get.size: 8
thread_pool.snapshot.max: 4
transport.compress: FALSE
transport.compression_scheme: DEFLATE
  1. Both the clusters have cluster.routing.allocation.awareness.attributes: zone and set es.search.ignore_awareness_attributes=false in the v8 cluster to overwrite the default true value.

  2. Some indices in the v7 cluster have default value for the index.codec and some have best_compression, whereas all the indices in the v8 clusters have best_compression codec.

  3. We’ve added sorting on the creation time field in the v8 cluster indices. Some of the v7 cluster indices did not have this sorting.

  4. Not sure if this is a clue or just noise, but we're seeing significantly higher network traffic (bytes sent/received are 3-5x times) in the v8 cluster. There's more cross AZ traffic in the v8 cluster as compared to the v7 cluster.

I can provide any extra details if relevant to this topic and could be helpful for investigation.

A few questions:

  1. What happens if you don't overwrite the default cluster settings in v8, any difference?
  2. What does the Search Profile API show when you run the same query on both clusters? Anything stand out as being different?

for #2, when I run same query on both cluster, they were equally fast (less than 30ms) and I think because they both hit cache. And when I manually clear all cache on indices on both clusters, they are both slow (sometimes over 5s). And some queries run faster on ES7 and some run faster on ES8. When I look at overall latency metrics, I can see obvious latency increase.

  1. What happens if you don't overwrite the default cluster settings in v8, any difference?

We initially did not overwrite any default settings in v8, and saw the huge difference in latency. So thought about changing all the possible settings to their corresponding v7 values that we have used in our live cluster.

  1. What does the Search Profile API show when you run the same query on both clusters? Anything stand out as being different?

Longsen has replied for answering this question. Let us know if you need more clarity.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.