Elastic Node failure observations

Parvez · September 17, 2019, 6:52pm

We recently conducted a Node failure test in our test environment. we have ,

Nodes: 3
Indices: 14
Total Shards: 94
Unassigned Shards:0

Questions:

When one of the node which is a master node was brought down we started seeing failures for 5 mins. Is that expected behaviour?
What is the factors we need to look for node failure testing.
When the underlying JVM pid was killed using kill -9 for one of the node, we saw 100% error and no transactions were being processed even when other nodes in the clusters were available. what does this means?
When I bringing down one node from a cluster of 3 it is taking time for the transition to happen from slave to master which caused lot of failure. ideally when the master node fails then data should be retrieved from the replica. I am not sure what is the fate of the existing transactions in this case will there be an retry happening. time taken to recovery is 5 mins here. if a new node doesn’t come up, does it re-adjusts the shards and replicas..

Christian_Dahlqvist · September 17, 2019, 6:55pm

Which version are you using? How is the cluster configured? Are you directing requests across all nodes in the cluster?

Parvez · September 17, 2019, 8:35pm

Which version are you using?
version number" : "6.6.0"

How is the cluster configured?
Our elastic is deployed in AWS cloud vpc and we have clustername in elasticsearch.yml

Are you directing requests across all nodes in the cluster?
the elasticsearch.yml just have clustername, threadpool, node name and some security pack settings

Christian_Dahlqvist · September 18, 2019, 5:09am

As all nodes correctly are master eligible, have you set discovery.zen.minimum_master_nodes to 2 according to these guidelines? Are you using instance types with constant CPU supply, e.g. not t2/t3 instances?

Parvez · September 18, 2019, 5:49am

We dont have this discovery.zen.minimum_master_nodes set which means it is using the default. However i can recommend to add that. having said I want to know 1. When one of the node which is a data node was brought down than seeing failures for 5 mins. before a new data node is recovered . Is that expected behaviour?
2. When one of the node which is a data node was brought down by kill -9 than seeing 100% failures continuously for longer duration . Is that expected behaviour?

Are you using instance types with constant CPU supply, e.g. not t2/t3 instances?
how do i figure out this?

Christian_Dahlqvist · September 18, 2019, 5:52am

This means that your cluster is incorrectly configured and likely to suffer network partitions.

In a properly configured cluster this is not expected. In an incorrectly configured cluster I am not sure what to expect as anything can happen. Note that this has been improved in Elasticsearch 7.x, making mistakes like this far less likely.

I would recommend correcting your configuration and retrying your test. I would also recommend upgrading to Elasticsearch 7.3 if possible.

I would check with whoever provisioned the instances for the cluster.

Parvez · September 18, 2019, 6:08am

Thanks for your inputs appreciated.

Parvez · September 18, 2019, 7:24pm

Today i run a test

Bring All 3 nodes ( node1, node 2 & node3) up and shutdown master node using service elasticsearch stop in this case node3 is master ,

I see the node2 was elected as master and there were two nodes now in cluster.

I see lot of failures and the failure rate did not go down even after 15 mins. When checked manually the first request to the page works when i refresh i get pagenotfound error. Always first navigation works and refresh on same page fails.

Is the failures expected in this scenario if not what needs to be done?

Now bring back the node3 that was shutdown , now all 3 nodes will be up
and I see node2 is still elected as master and everything is working fine

Christian_Dahlqvist · September 18, 2019, 7:28pm

Where do you see the failure rate? How are you querying the cluster?

Parvez · September 19, 2019, 2:13pm

Failures are in read transactions from elasticsearch , the end user will see pagnotfound error. I am running load test on my application and internally the transaction read the records from elastic and display on the webpage page .

What i think is when i do "service elasticsearch stop" on one node, than the transactions are expected to fail since the underlying elastic service on that node is not available. please comment?

On the other side if i bring down the node through AWS by selecting the ec2 instance and click on stop , then the failures are less because the ASG will kick in to spawn another node in the cluster.

Christian_Dahlqvist · September 19, 2019, 4:19pm

I guess it depends on how your client or load balancer handles failures.

system · October 17, 2019, 4:19pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch 3 node cluster failing if master is down second time Elasticsearch	9	3187	June 25, 2017
Master node failure causes cluster to fail Elasticsearch	3	1669	July 6, 2017
Master Node Failover ..? Elasticsearch	2	944	July 6, 2017
Could not get cluster status after master node goes down Elasticsearch	51	5715	July 13, 2018
ElasticSearch resilency problem Elasticsearch	1	405	July 6, 2017

Elastic Node failure observations

Related topics