What can go wrong in ES cluster without replicas

Hi, the typical paradigm in ES is that sharding involves replica nodes, which serve two main purposes - act as a backup, so that in case one shard or one node is down, its replica continues to provide the same data until the primary shard is restored, and secondly, the incoming query may come to a machine where some matching data can be found right there in a replica copy, instead of travelling to some other node to find it in the primary shard, saving time.

How well can ES perform in absence of replicas?

  1. Obviously there will be no backup, but let's assume that it's a prototype with a limited release, and the data center is tier 4 certified with 99.99999% uptime. How often and how probable is it that lack of replicas will hurt regularly, or sporadically? Of course I know this is not a permanent solution, just a temporary one.

  2. Imagine there are 1000 nodes, and once in a while an incoming query to one node will find a data in that node itself from some replica. Now that won't be there, so it has to travel the length of the entire cluster (which it has to do anyway to find results from all nodes involved for a good overall search result with top 10 results) every time. How much average perf cost will that incur?

Any other problem that may come up?

Technically the replica is promoted to primary and will be the primary from that point on. Not having replicas in the cluster will make you notice cluster and/or node issues more frequently.

It is not only node failures that can lead to issues. You may have nodes get overloaded or suffer from long GC, which could cause issues and make queries return partial results and/or time out.

I have never seen a cluster that large. Usually you hit limits before that point and instead deploy multiple clusters and use cross-cluster search.

The more nodes you have in the cluster the more nodes are likely to be involved in each query (unless you have very targeted queries), which will include network traffic.

If you go down this path, make sure you also take snapshots using the snapshot API frequently as any corruption or disk issue could force you to restore and potentially lose data.

Thanks Christian. Yes, by cluster I did not mean one cluster involving all 1000 nodes, I referred to the entire collection as one cluster.

Any statistics, or approximation, of how much perf hit may happen if there is no backup? Will queries which take 10 ms of time might take 5000 ms?

And secondly, any metric or stat as to how often nodes or shards go down? If there are couple of on-call guys who monitor node health, can't the get the signal that some shard in some node is down somewhere and try to fix it ASAP, minimizing potential repercussions?

That is impossible to tell as it depends on data, queries, configuration and hardware.

I would expect this to vary depending on load and use case.

Thanks Christian. Of course it will depend upon a bunch pf parameters, what I am looking for is folks to speak from what they have experienced, even though the conditions may totally vary in our case. That will, at the very least, give me some sort of idea, as crude as it may be.

I would not necessarily expect it to affect performance much, but rather resiliency and stability.

Of course. Got it. But I guess with HDD, the perf will take a massive hit as you said in an earlier thread. :frowning: The thing is, HDD is cheap and SSD is freaking expensive. :frowning:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.