ES 2.2 cluster - how to deal with data loss?

mattb · March 7, 2016, 11:49am

Hello,

I recently started some cluster testing with ES:

3 nodes for ES (2.2)
all nodes are master- and data nodes
client: logstash, with ES output plugin

Test scenario:

Continuous, significant load via logstash
shut down master node (Ctrl C)
-> new master node is elected
start node previously shut down again
when status of ES cluster is still yellow:
shutdown newly selected master node

With this test scenario, I can easily create data loss.
I've read through quite some documents, result:
In status "yellow", data loss is "acceptable" when problems occur.
Thus, this situation is treated as "OK" from ES point of view.
However, from a users point of view, it is not:
An administrator doesn't know if data loss occurred or not: Important error log information could have been lost.

Basically, I would expect that I can determine if data loss occurred.
Any proposal how we could deal with this situation?
I think increasing number of nodes/replicas will improve the situation.
But still, it could occur.
Any best practice out there?

Many thanks in advance for your help.

bleskes · March 8, 2016, 10:24am

I need to understand the test case better. When you shut down the second node, is that the node that has the only single "good" copy of the data (i.e., the primary)?

If so my guess is that you run into https://github.com/elastic/elasticsearch/issues/14671 , fixed for the coming 5.0 release by https://github.com/elastic/elasticsearch/issues/14739 .

When the single copy is lost, data is lost (no way around that, but we report that number of successful shards in the response of write operation so you can monitor this per write as well). The main problem is that ES automatically decides that having some stale data is better than having no data at all and when the stale copy comes back it chooses it as a new primary. The fixes I linked above prevent that and requires manual intervention.

mattb · March 8, 2016, 6:03pm

Many thanks for your answer.
I cannot 100% confirm that the situation you described is the one I have.
But it's what I assume that is going on.
My major problem is not having the data loss but not recognizing the data loss.
With a required manual interaction this problem would be gone.
Is there any documentation available about how ES is internally working, mainly how error situations are handled?
Thus, I could understand much better what I can expect.

I'd like to check the number of shards written.
But this is done via logstash+ES-output-plugin.
As far as I've seen there's no way to configure the min. number of shards to be written.

bleskes · March 9, 2016, 7:55am

Agreed. This is why with ES 5.0 the cluster will stay RED and require a user ack for the data loss. Also, the shard header in write response tells you how close you are to this situation.

Working on it. "how ES is internally working" is a big topic

This sounds to me like a good feature request for the logstash team - stop indexing and do not ack documents that were indexed to <X copies. I suggest you open a ticket on the github project for logstash...

Topic		Replies	Views
Data loss on shutdown of node in a 3 node cluster Elasticsearch	2	347	July 6, 2017
Data is lost after elasticsearch restart Elasticsearch	4	503	April 25, 2023
Detecting Data Loss Elasticsearch	6	1613	January 21, 2018
Elasticsearch as a primary store Elasticsearch	1	369	June 18, 2018
Cluster partition resulted in loss of data Elasticsearch	5	426	July 6, 2017

ES 2.2 cluster - how to deal with data loss?

Related topics