Elastic cluster HA setup

jbacic · April 18, 2018, 7:05am

I've got one question regarding elastic search cluster settings. I managed to setup elastic cluster on Rancher platform but I came across one weird (at least to me) issue. When one node goes down, cluster seems to be unavailable for over 1.5 minute. To be honest, I was hoping to get faster node failure detection. As far as I understand, this can be speed up by changing:

discovery.zen.fd.ping_retries: 3 
discovery.zen.fd.ping_interval: 1s 
discovery.zen.fd.ping_timeout: 30s

Is there any reason I should leave such high values or is it safe to set ping_timeout to, let's say, 5s? My healthchecks in Rancher are set to default (interval: 2s, timeout: 2s, retries: 3) and they work just fine.

Thanks in advance for help!

DavidTurner · April 18, 2018, 7:29am

When you say "down" do you mean that Elasticsearch is shutting down cleanly or just that its container is being terminated? On a clean shutdown the settings you quote should have little effect.

The risk with reducing the fault detection values is they will pick up more false positives. On a node failure Elasticsearch must rebalance itself, perhaps electing a new master, promoting some new shards to primaries, and recovering any missing shard copies, all of which can be quite a lot of work, so it's worth trying to avoid false positives.

jbacic · April 18, 2018, 8:21am

By “down” I meant host/network failure, not a clean shutdown.

I understand that rebalancing can be time-consuming so how about increasing recovery delay so it won’t kick in right after node failure (I think there were some settings for that as well)? I assume that failed node will be replaced (most probably without data loss) within short period of time so there’s no point in starting the rebalancing procedure.

DavidTurner · April 18, 2018, 3:59pm

Ok, this makes it indistinguishable from packet loss, and a temporary network failure is the most likely cause of that. This shouldn't make the cluster unavailable, but indexing that's trying to hit the lost node will wait for the shards on the lost node to either respond or time out and be failed, which might be what you're seeing.

Yes, I think you mean index.unassigned.node_left.delayed_timeout. That stops shards from being reallocated elsewhere when the master decides that a node has failed, but doesn't stop another master election, nor any replicas being promoted to primaries.

system · May 16, 2018, 3:59pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cluster failures Elasticsearch	2	300	July 6, 2017
Cluster hanging on node failure Elasticsearch	2	540	July 6, 2017
Long period of querying failure during node timeout Elasticsearch	4	1127	May 15, 2020
Cluster stalls when nodes are removed (or the true meaning of expected_nodes) Elasticsearch	10	553	July 6, 2017
Transient Network Outage and Cluster Health Elasticsearch	2	296	July 6, 2017

Elastic cluster HA setup

Related topics