Best strategy for fast recoveries after node restart

jthoni · October 27, 2020, 2:35am

We have a 90 node cluster running in an Azure Service Fabric application. We have a few services in this application (logstash, a watchdog service, some compliance services, etc.). We occasionally need to roll updates out to the SF App. Azure is not Elasticsearch-aware and I don't want all the nodes going down at once. I have a watchdog that I have integrated with Azure Patch Orchestration so that after the app is upgraded, it will monitor ES for both health and shard movement, and only proceed to the next node when all is good. The issue is that as I can't pause ingestion, when the node goes down and comes back up (I do have an allocation delay), there will be changes in the translog. This takes several minutes to play out. Over the course of the 90 nodes, doing one at a time with all the stabilization, an upgrade can take +8 hours. This long delay causes problems if we find we want to fix-deploy-verify.

Are there any suggestions on how to best deal with rolling restarts when you can't pause ingestion?

Thanks!

system · November 24, 2020, 2:35am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Noob Question: Why is restarting a node anything other than instantaneous? Elasticsearch	7	417	July 6, 2017
How to modify the ES configuration and reboot ES cluster properly? Elasticsearch	7	813	September 7, 2017
High recovering time during rolling restart of Elasticsearch 6.2 Elasticsearch	2	401	January 4, 2019
Restart Elasticsearch Instance one at a time Elasticsearch	3	119	March 28, 2025
Takes forever to bring cluster back up Elasticsearch	4	829	July 5, 2017

Best strategy for fast recoveries after node restart

Related topics