Indexation stopped during indices rollover

Josselin · April 16, 2025, 8:02am

Hello,

We have currently one big cluster (300+ nodes) with tiering.
We are using Elasticsearch version 8.4.3

Since some months we are seeing issues during indices rollover, where the bulk index requests are delayed for more than 30s.
This generate drop in our indexation rate which are quite heavy to recover.

I assumed during a rollover the new index to write would be 100% operational before the alias is switched and so the bulk requests should always be handled without downtime.

If someone has an idea to fix this issue we would be glad.

Thank you for your time

Christian_Dahlqvist · April 16, 2025, 8:22am

What is the full output of the cluster stats API?

Rollover requires new indices to be created and aliases to be changed so will result in a number of cluster state updates. In such a large cluster I would not be surprised if this takes some time to perform and propagate, especially if nodes are under heavy load.

How many dedicated master nodes do you have? Are these exempt from serving client requests? What is the specification of these nodes? Do you see anything in the logs about delays around cluster state update and propagation?

Josselin · April 16, 2025, 8:46am

Thank you for your answer.

We currently have 3 dedicated master nodes.
Two of these nodes have this specs : AMD EPYC GENOA 9254 - 24c/48t - 2.9 GHz/3.9 GHz - 125 Go Heap
One have this specs (we are in the process of replacing it with same specs as the other two) : AMD Epyc 7532 - 32c/64t - 2.4 GHz/3.3 GHz - 256Go ram - 125 Go Heap
One node is considered master but in voting only configuration. It handle metrics request and its specs are : Intel Xeon-E 2388G - 8c/16t - 3.2 GHz/4.6 GHz - 64Go ram - 32Go Heap

The current master elected is one of the servers with the best specs.

The master servers role is only master and they should not handle any client request.
Client requests are sent to a pool of coordinator only nodes.

Yes we are seeing logs about delays around cluster state propagation :

{"@timestamp":"2025-04-16T08:43:13.684Z", "log.level": "INFO", "message":"after [10s] publication of cluster state version [42162703] is still waiting for

But when we switch some nodes in log DEBUG level they all seems to write cluster state diff really quickly (200ms max)

I am computing the cluster stats result and I will send it to you in DM due to potential private information, if that's okay for you.

Josselin · April 16, 2025, 9:10am

@Christian_Dahlqvist here is the cluster stats file : Cluster stats - Pastes.io
Password : elastic2025

Topic		Replies	Views
Rollover request failed but the index is created Elasticsearch	7	821	December 4, 2019
Rollovers always done on the same node Elasticsearch ilm-index-lifecycle-management	3	319	January 4, 2022
Index rollover broken after connectivity issues between nodes in a 3 node ELK cluster Elasticsearch	5	749	November 21, 2018
Influence of Master Nodes and Large Cluster State on Search Elasticsearch	8	457	June 15, 2021
Indexing rate became very low in elastic cluster Elasticsearch	15	429	March 5, 2024

Indexation stopped during indices rollover

Related topics