2 Nodes ES cluster becomes unavailable for 2 -3 mins if one node (master) goes down


(Gaurav gupta) #1

One of the customer has only 2 nodes cluster ( and don't want to add 3rd node) which becomes inaccessible for 2-3 mins, if first node (master) goes down. Below is the error which user face :-

SearchPhaseExecutionException: Failed to execute phase [query_fetch], all shards failed

And after few mins (2-3 mins) second nodes takes the charge and it start responding to incoming requests.

Since, we can't force user to add 3rd node in cluster and they need second node just for fault tolerance purpose, so can we suggest user to wait till he gets "SearchPhaseExecutionException" or any such exception. Once the another node sends master alive signal/response then he can start sending requests.

Thoughts ?

Thanks
Gaurav


(Mark Walkom) #2

You need to find why are the shards failing.

Are you monitoring ES? Check your logs too.


(Gaurav gupta) #3

Below is the exception message which says that first node ( master node ) has left or not connected as user has shut down it, for testing purpose. And now, until it elects and promote Node2 as master node, it throws below exception for around 2-3 mins :-

SearchPhaseExecutionException: Failed to execute phase [query_fetch], all shards failed; shardFailures {[wvn04kFMTYCqNMW_9cKd1A][qlpanoramasearchindex2][0]: SendRequestTransportException[[node1][inet[/10.2.10.185:9300]][indices:data/read/search[phase/query+fetch]]]; nested: NodeNotConnectedException[[node1][inet[/10.2.10.185:9300]] Node not connected]; }


(Mark Walkom) #4

So why did the node leave? That's what you need to answer.
Check your network, firewalls etc. What does your config look like?


(Gaurav gupta) #5

Actually, user is doing User acceptance testing in which one of the scenario is to shut down or remove one of the node, manually (i.e. remove first node which is currently master). Since, he is manually, removing the master node so incoming requests fail with exception for 2-3 mins. After 2-3 mins things work fine. Also, please note that it's occurring during load testing when user manually shut down the master node. Isn't 2nd node should become master immediately. Is it an accepted behaviour as election of new master node takes 2-3 mins?

Should we try to tweak the below settings, something like below so that new master has been elected with lesser delay :-

discovery.zen.fd.ping_timeout: 10s
discovery.zen.fd.ping_retries: 2

Thanks
Gaurav


(Mark Walkom) #6

That would help.

Can I ask why you are doing this sort of testing?


(Gaurav gupta) #7

We are doing this type of testing since any node might go down (may be a network issue or heating issue or any other reason) in production environment also. So, this test scenario is just to make sure that ES works reliably with minimum or no down time in such scenarios i.e. high availability, fault tolerant behaviour of ES cluster in even worst case scenario.

Thanks
Gaurav


(Mark Walkom) #8

Are you running dedicated masters?


(Gaurav gupta) #9

No, we are just IPs as :- discovery.zen.ping.unicast.hosts=152.144.226.42,152.144.226.12. Generally, we start first node i.e. 152.144.226.42 fisrt and once this node is up we start 2nd node i.e. 152.144.226.12

Note :- We are unicast instead multicast.

Thanks


(Mark Walkom) #10

If you want the most tolerable cluster then you will want dedicated masters.


(Gaurav gupta) #11

After discussing with user, I come to know they using the UNICAST with nodes like - discovery.zen.ping.unicast.hosts=10.2.10.185,10.9.10.185

And when they submit 5000 requests, they observe that after processing 2000 requests all further requests fails with error "SearchPhaseExecutionException: Failed to execute phase [query_fetch], all shards failed; shardFailures"


(system) #12