2 Nodes ES cluster becomes unavailable for 2 -3 mins if one node (master) goes down

Gaurav_gupta · August 6, 2015, 1:58pm

One of the customer has only 2 nodes cluster ( and don't want to add 3rd node) which becomes inaccessible for 2-3 mins, if first node (master) goes down. Below is the error which user face :-

SearchPhaseExecutionException: Failed to execute phase [query_fetch], all shards failed

And after few mins (2-3 mins) second nodes takes the charge and it start responding to incoming requests.

Since, we can't force user to add 3rd node in cluster and they need second node just for fault tolerance purpose, so can we suggest user to wait till he gets "SearchPhaseExecutionException" or any such exception. Once the another node sends master alive signal/response then he can start sending requests.

Thoughts ?

Thanks
Gaurav

warkolm · August 7, 2015, 8:27am

You need to find why are the shards failing.

Are you monitoring ES? Check your logs too.

Gaurav_gupta · August 7, 2015, 7:16pm

Below is the exception message which says that first node ( master node ) has left or not connected as user has shut down it, for testing purpose. And now, until it elects and promote Node2 as master node, it throws below exception for around 2-3 mins :-

SearchPhaseExecutionException: Failed to execute phase [query_fetch], all shards failed; shardFailures {[wvn04kFMTYCqNMW_9cKd1A][qlpanoramasearchindex2][0]: SendRequestTransportException[[node1][inet[/10.2.10.185:9300]][indices:data/read/search[phase/query+fetch]]]; nested: NodeNotConnectedException[[node1][inet[/10.2.10.185:9300]] Node not connected]; }

warkolm · August 7, 2015, 10:58pm

So why did the node leave? That's what you need to answer.
Check your network, firewalls etc. What does your config look like?

Gaurav_gupta · August 8, 2015, 12:37pm

Actually, user is doing User acceptance testing in which one of the scenario is to shut down or remove one of the node, manually (i.e. remove first node which is currently master). Since, he is manually, removing the master node so incoming requests fail with exception for 2-3 mins. After 2-3 mins things work fine. Also, please note that it's occurring during load testing when user manually shut down the master node. Isn't 2nd node should become master immediately. Is it an accepted behaviour as election of new master node takes 2-3 mins?

Should we try to tweak the below settings, something like below so that new master has been elected with lesser delay :-

discovery.zen.fd.ping_timeout: 10s
discovery.zen.fd.ping_retries: 2

Thanks
Gaurav

warkolm · August 9, 2015, 3:03am

That would help.

Can I ask why you are doing this sort of testing?

Gaurav_gupta · August 10, 2015, 8:36am

We are doing this type of testing since any node might go down (may be a network issue or heating issue or any other reason) in production environment also. So, this test scenario is just to make sure that ES works reliably with minimum or no down time in such scenarios i.e. high availability, fault tolerant behaviour of ES cluster in even worst case scenario.

Thanks
Gaurav

warkolm · August 10, 2015, 8:39am

Are you running dedicated masters?

Gaurav_gupta · August 10, 2015, 8:55am

No, we are just IPs as :- discovery.zen.ping.unicast.hosts=152.144.226.42,152.144.226.12. Generally, we start first node i.e. 152.144.226.42 fisrt and once this node is up we start 2nd node i.e. 152.144.226.12

Note :- We are unicast instead multicast.

Thanks

warkolm · August 10, 2015, 10:17pm

If you want the most tolerable cluster then you will want dedicated masters.

Gaurav_gupta · August 18, 2015, 12:18pm

After discussing with user, I come to know they using the UNICAST with nodes like - discovery.zen.ping.unicast.hosts=10.2.10.185,10.9.10.185

And when they submit 5000 requests, they observe that after processing 2000 requests all further requests fails with error "SearchPhaseExecutionException: Failed to execute phase [query_fetch], all shards failed; shardFailures"

Topic		Replies	Views
ES failure for few seconds during master re-elect Elasticsearch	4	547	July 6, 2017
Master node hangs when multiple data nodes are shutdown at the same time Elasticsearch	6	980	July 6, 2017
Elasticsearch 3 node cluster failing if master is down second time Elasticsearch	9	3192	June 25, 2017
MasterNotDiscoveredException Elasticsearch	1	306	July 6, 2017
Elasticsearch cluster unexpected failure for few minutes when one node down Elasticsearch	4	377	June 18, 2018

2 Nodes ES cluster becomes unavailable for 2 -3 mins if one node (master) goes down

Related topics