Elasticsearch cluster search performance is bad after upgrade from 7.17 to 8.8

Hello Elasticsearch Community,

We recently did in-place upgrade from 7.17 to 8.8 and after which we started to see degraded search performance/latency. We have 150 data nodes and we observed that at most 10 data nodes are having cpu usage near to 100% and for the remaining data nodes the cpu utilization is less than 10%. We are trying to understand why only few nodes are having high load.

Current Cluster Details:

150 data nodes (25 core cpu), 10 coordinator nodes and 5 master nodes

jvmArgs: -Xms30g -Xmx30g and memory 64Gi

running on JVM: 17.0.7 and ES: 8.8.0

Indices: We have 66 indices and the number is constant and we dont create new indices at all, instead create/overwrite documents in existing es indices

shards: 2,222

Size: 11.54 TB

docs: 20 Billion Docs

Our observations

  1. We checked the hot threads of elasticsearch and we observed it is utilizing all the cpu threads for the search. And our search queries remained the same before and after the upgrade

  1. We observed that the index segment merges started to increase after the upgrade is complete. We dont have any changes on our write volume that would trigger this more segment merges.

Suspect:

Because we are utilizing pre-existing indices post the upgrade, we have an internal suspicion that the Lucene files associated with these indices might not have been reindexed to the newer version (9.6) of Lucene. Could this be a possibility in this case and could be the reason behind this degradation ?.

However, we lack a means to confirm this assumption. Additionally, we're uncertain about the reasons behind the consistent high load observed in only a few nodes.

Can you please share your thoughts on this and help us resolve our issue

Sounds like not an issue according to upgrading guide.
Have you tried to restart the service on those high CPU nodes?
I have seen high CPU getting resolved simply by restarting the service in earlier releases. I haven't have to do that on 7.15. But it's worth a try if you haven't try it yet.

Upgrade Elasticsearch

Elasticsearch clusters can usually be upgraded one node at a time so upgrading does not interrupt service. For upgrade instructions, refer to Upgrading to Elastic 8.8.2.

Upgrade from 7.x

To upgrade to 8.8.2 from 7.16 or an earlier version, you must first upgrade to 7.17, even if you opt to do a full-cluster restart instead of a rolling upgrade. This enables you to use the Upgrade Assistant to identify and resolve issues, reindex indices created before 7.0, and then perform a rolling upgrade. You must resolve all critical issues before proceeding with the upgrade. For instructions, refer to Prepare to upgrade from 7.x.

Index compatibility

Elasticsearch has full query and write support for indices created in the previous major version. If you have indices created in 6.x or earlier, you might use the archive functionality to import them into newer Elasticsearch versions, or you must reindex or delete them before upgrading to 8.8.2. Elasticsearch nodes will fail to start if incompatible indices are present. Snapshots of 6.x or earlier indices can only restored using the archive functionality to a 8.x cluster even if they were created by a 7.x cluster. The Upgrade Assistant in 7.17 identifies any indices that need to be reindexed or removed.

We tried restarting the nodes. But nodes are running into high cpu.. so we are still trying to figure out why the load is not properly getting distributed

Does ES log show anything interesting?
Prolong 100% CPU is not normal.
Try to move few heavy shards to low CPU nodes and see what happens.
Make sure your cluster is not actively moving shards first. Maybe due to auto balancing. There was an issue where auto balancing code was not right. It moves shards in & out of a new node. Not sure if it has been fixed. But that's what I would check first.

  1. ensure no active recovery going on.
  2. move 1 heavy shard out of one of the high CPU node and observe.
  3. repeat 2) until it's cpu starts to drop. You might need to move some light shards into that high CPU node to balance out the shard count.

The cpu has to drop at some point for that node.

*Adding a(few) new data node(s) might also achieve the same effect.

GET /_cat/tasks?v&detailed=true
Above command tells you a snapshot of active tasks. Issue it back to back few times and observe the "running_time" column. Do you see any tasks taking multiple seconds to execute, etc.
This could also potentially point you to the area to check.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.