20 seconds downtime when swapping alias

Hi all. I have a service requires zero down time. Since the data need to be versionized, our pipeline will ingest data every day into a new index. After the new index is "green" and able to search, we simply swap the alias with old one. But after the swapping, there is a 20 seconds period that Elasticsearch hang up for the queries. It leads to a 10-25 seconds latency. Any idea how we can avoid that?

For the swapping operation, we tried add the index to that alias and delete old one. But we also tried use the creation and deletion in the same query for atomic operation. Both doesn't work.

Welcome to our community! :smiley:

What version are you on?

Where are you seeing this?

Doesn't work in what way?

The version is 7.13.2.
The services calling the Elasticsearch were waiting for 20s. But also there is a huge gap in our kibana:

Sounds like the cluster state update might be taking long. What is the full output of the cluster stats API? What is the specification of your hardware and storage? What type of load is your cluster under?

The index to swap contains 10,000,000 to 20,000,000 docs. The cluster itself it’s been hold in the azure’s k8s with sufficient high tier resources (cpu/memory/disk loads are in reasonable range). I assume there are some warm up issue after switching alias, just want to know how I can measure it correctly? Currently I’m polling the index till it’s green and switch alias after that. I even test via a small query to make sure it’s actually working. But queries to target alias still freeze when the worker ask for swapping.

What is the output from the _cluster/stats?pretty&human API?

It could be caused by a large cluster state, e.g. due to a very high number of shards in the cluster, slow disk I/O or just high load on the cluster.

I would suggest calling GET _nodes/hot_threads?threads=9999 a few times during the 20-second pause to find out what Elasticsearch is actually doing at that time. One possible explanation is that the caches on this new index are cold, so the first few queries after the switchover have to do a lot of extra work. If so, you can warm it up by doing some realistic queries before the switchover.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.