ES 2.2 cluster - how to deal with data loss?


#1

Hello,

I recently started some cluster testing with ES:

  • 3 nodes for ES (2.2)
  • all nodes are master- and data nodes
  • client: logstash, with ES output plugin

Test scenario:

  • Continuous, significant load via logstash
  • shut down master node (Ctrl C)
    -> new master node is elected
  • start node previously shut down again
  • when status of ES cluster is still yellow:
    shutdown newly selected master node

With this test scenario, I can easily create data loss.
I've read through quite some documents, result:
In status "yellow", data loss is "acceptable" when problems occur.
Thus, this situation is treated as "OK" from ES point of view.
However, from a users point of view, it is not:
An administrator doesn't know if data loss occurred or not: Important error log information could have been lost.

Basically, I would expect that I can determine if data loss occurred.
Any proposal how we could deal with this situation?
I think increasing number of nodes/replicas will improve the situation.
But still, it could occur.
Any best practice out there?

Many thanks in advance for your help.


(Boaz Leskes) #2

I need to understand the test case better. When you shut down the second node, is that the node that has the only single "good" copy of the data (i.e., the primary)?

If so my guess is that you run into https://github.com/elastic/elasticsearch/issues/14671 , fixed for the coming 5.0 release by https://github.com/elastic/elasticsearch/issues/14739 .

When the single copy is lost, data is lost (no way around that, but we report that number of successful shards in the response of write operation so you can monitor this per write as well). The main problem is that ES automatically decides that having some stale data is better than having no data at all and when the stale copy comes back it chooses it as a new primary. The fixes I linked above prevent that and requires manual intervention.


#3

Many thanks for your answer.
I cannot 100% confirm that the situation you described is the one I have.
But it's what I assume that is going on.
My major problem is not having the data loss but not recognizing the data loss.
With a required manual interaction this problem would be gone.
Is there any documentation available about how ES is internally working, mainly how error situations are handled?
Thus, I could understand much better what I can expect.

I'd like to check the number of shards written.
But this is done via logstash+ES-output-plugin.
As far as I've seen there's no way to configure the min. number of shards to be written.


(Boaz Leskes) #4

Agreed. This is why with ES 5.0 the cluster will stay RED and require a user ack for the data loss. Also, the shard header in write response tells you how close you are to this situation.

Working on it. "how ES is internally working" is a big topic :wink:

This sounds to me like a good feature request for the logstash team - stop indexing and do not ack documents that were indexed to <X copies. I suggest you open a ticket on the github project for logstash...


(system) #5