Data stream resilience

Hello

I'm migrating indices->datastreams in relatively large elastic cluster (15 data nodes, 80k docs per second).

For our current indices-based setup we used to monitor if currently used index get red to create new index so that no data is lost (such solution was implemented in es 2.0 days and carry over to v5 and now v8), it was very helpful during server patching, when one by one nodes were shut down for some time.

Now we're migrating to data streams (on v8 of course) and I wonder if such strategy still makes sense. I planned to change mechanism so that data stream health is monitored and if it is red rollover is forced. This mechanism seem to work on my test cluster, but when I checked underlying indices (after restarting few nodes) all of them had shards on all nodes (including ones that were temporarily down).

Are there some new mechanisms in data streams that improve resilliency and in particular handle nodes being down temporarily?

Also what are good practices you recommend for data streams to be resilient in such usecase.

Thanks,
Lukasz

1 Like

Welcome to our community! :smiley:

The best approach would be to resolve why they are turning red, not forcing a rollover on red.

How many primary and replica shards are you indexing into? Do you have enough resiliency built in? Do you have any settings that limit the number of shards for an index per node in place?

It is recommended to have a replica shard configured and ensure the nodes have enough capacity to relocate shards from a failed node if necessary.

1 Like

I have enough capacity and each index has enough shards to be distributed over all nodes.

Problem is that due to costs I do not have replicas. For that reason it was crucial to create new active index when one node was down so that shartds of new index shards are located on running nodes.

Do you suggest that replica is rather necessary? And second question: do you thnink having only few hours of hot index with replica and rest warm indices without would be enough?

If you want resilience, yes.

Indexing will apply backpressure but if you suffer from any type of corruption or storage loss you will lose data, so having replica in place is recommended.

The recommended practices in this area are all covered at length in the high availability section of the manual.

For resilience you definitely need replicas on indices to which you're writing. For read-only indices you should either use replicas or searchable snapshots. If you're ok with some downtime and manual recovery steps after a failure you might consider removing replicas from read-only indices after they have been snapshotted instead.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.