Best strategy for fast recoveries after node restart

We have a 90 node cluster running in an Azure Service Fabric application. We have a few services in this application (logstash, a watchdog service, some compliance services, etc.). We occasionally need to roll updates out to the SF App. Azure is not Elasticsearch-aware and I don't want all the nodes going down at once. I have a watchdog that I have integrated with Azure Patch Orchestration so that after the app is upgraded, it will monitor ES for both health and shard movement, and only proceed to the next node when all is good. The issue is that as I can't pause ingestion, when the node goes down and comes back up (I do have an allocation delay), there will be changes in the translog. This takes several minutes to play out. Over the course of the 90 nodes, doing one at a time with all the stabilization, an upgrade can take +8 hours. This long delay causes problems if we find we want to fix-deploy-verify.

Are there any suggestions on how to best deal with rolling restarts when you can't pause ingestion?

Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.