Increased Cluster Latency during adding of aliases with 30 nodes

Running a 6.8.0 cluster with 30 nodes. All nodes are data nodes and 1 of those nodes is a master enabled data node. Every node in the cluster is sized with 32 CPUs and 64 GB RAM. Cluster is running in Azure deployed via https://github.com/elastic/azure-marketplace deployment script. Total cluster data is about 450 Gb and 350 million documents. 31 primary shards and 231 total replicas. The heap size is set to 50% of the available memory on each node.

We load data into this cluster during early morning hours when our search traffic is exceptionally low volume. We have two indices and two aliases - one for reading data for the index we read from and one for the index we write data to. As part of our data load we delete and recreate one of the indices, load data into it, then assign the read alias to that index with fresh data.

When we assign the read alias to the new index we have begun to notice very high amounts of latency often times greater than 30 seconds. This is a relatively new issue happening with our cluster that has been running steadily without this issue for the past 6 months. This causes timeouts in our data load job which I understand could be written in such a way to handle these failures, but that's a different discussion.

Some things to make note of:

  • we typically see this issue with a high number of nodes. It does not present itself with only 10 active nodes but will sporadically present itself with 15 or more nodes.
  • this issue didn't present itself until about 6 months in from the date our cluster was first deployed. Is there some old stale data somewhere that is causing latency (idk)?

My potential ideas/fixes:
1.) Put a longer duration between the time the data load finishes and the time we assign the read alias to our new index. We currently wait about 10 seconds between this step. Does the cluster need some amount of time to rebalance the data across nodes?

Can anyone shed some light on this? It's been a pain to debug and our team is looking to put it to rest.

Many thanks in advance.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.