Elastic Node failure observations

We recently conducted a Node failure test in our test environment. we have ,

Nodes: 3
Indices: 14
Total Shards: 94
Unassigned Shards:0

Questions:

  1. When one of the node which is a master node was brought down we started seeing failures for 5 mins. Is that expected behaviour?

  2. What is the factors we need to look for node failure testing.

  3. When the underlying JVM pid was killed using kill -9 for one of the node, we saw 100% error and no transactions were being processed even when other nodes in the clusters were available. what does this means?

  4. When I bringing down one node from a cluster of 3 it is taking time for the transition to happen from slave to master which caused lot of failure. ideally when the master node fails then data should be retrieved from the replica. I am not sure what is the fate of the existing transactions in this case will there be an retry happening. time taken to recovery is 5 mins here. if a new node doesn’t come up, does it re-adjusts the shards and replicas..

Which version are you using? How is the cluster configured? Are you directing requests across all nodes in the cluster?

Which version are you using?
version number" : "6.6.0"

How is the cluster configured?
Our elastic is deployed in AWS cloud vpc and we have clustername in elasticsearch.yml

Are you directing requests across all nodes in the cluster?
the elasticsearch.yml just have clustername, threadpool, node name and some security pack settings

As all nodes correctly are master eligible, have you set discovery.zen.minimum_master_nodes to 2 according to these guidelines? Are you using instance types with constant CPU supply, e.g. not t2/t3 instances?

We dont have this discovery.zen.minimum_master_nodes set which means it is using the default. However i can recommend to add that. having said I want to know 1. When one of the node which is a data node was brought down than seeing failures for 5 mins. before a new data node is recovered . Is that expected behaviour?
2. When one of the node which is a data node was brought down by kill -9 than seeing 100% failures continuously for longer duration . Is that expected behaviour?

Are you using instance types with constant CPU supply, e.g. not t2/t3 instances?
how do i figure out this?

This means that your cluster is incorrectly configured and likely to suffer network partitions.

In a properly configured cluster this is not expected. In an incorrectly configured cluster I am not sure what to expect as anything can happen. Note that this has been improved in Elasticsearch 7.x, making mistakes like this far less likely.

I would recommend correcting your configuration and retrying your test. I would also recommend upgrading to Elasticsearch 7.3 if possible.

I would check with whoever provisioned the instances for the cluster.

Thanks for your inputs appreciated.

Today i run a test

  • Bring All 3 nodes ( node1, node 2 & node3) up and shutdown master node using service elasticsearch stop in this case node3 is master ,

I see the node2 was elected as master and there were two nodes now in cluster.

I see lot of failures and the failure rate did not go down even after 15 mins. When checked manually the first request to the page works when i refresh i get pagenotfound error. Always first navigation works and refresh on same page fails.

Is the failures expected in this scenario if not what needs to be done?

  • Now bring back the node3 that was shutdown , now all 3 nodes will be up
    and I see node2 is still elected as master and everything is working fine

Where do you see the failure rate? How are you querying the cluster?

Failures are in read transactions from elasticsearch , the end user will see pagnotfound error. I am running load test on my application and internally the transaction read the records from elastic and display on the webpage page .

What i think is when i do "service elasticsearch stop" on one node, than the transactions are expected to fail since the underlying elastic service on that node is not available. please comment?

On the other side if i bring down the node through AWS by selecting the ec2 instance and click on stop , then the failures are less because the ASG will kick in to spawn another node in the cluster.

I guess it depends on how your client or load balancer handles failures.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.