Cluster stalls when nodes are removed (or the true meaning of expected_nodes)

Ivan · January 11, 2013, 1:35am

One of my clusters has 8 nodes running 0.20.0.RC1. Most settings are the
default except for:

bootstrap.mlockall: true
transport.tcp.connect_timeout: 5s
gateway.expected_nodes: 8 <-- we'll get to this in a second
discovery.zen.minimum_master_nodes: 5
discovery.zen.ping.multicast.enabled: true

As you can see, the number of expected nodes is equals to the total number
of nodes in the cluster. If one of the nodes disappears from the cluster,
the clusters stalls completely for about 1-2 minutes.

As a test, gateway.expected_nodes was reduced to 6. After changing the
setting, the cluster no longer stalls if a node disappears.

The general consensus is to set gateway.expected_nodes to the number of
nodes in the cluster. Is gateway recovery what is affecting the cluster? If
the cluster does not respond to request during recovery, shouldn't
the gateway.expected_nodes value be set to something lower in case a node
goes down?

Ivan

--

Igor_Motov · January 15, 2013, 12:23am

This setting should have no effect after initial recovery. Are you sure
that the effect that you observed wasn't a coincidence?

On Thursday, January 10, 2013 8:35:17 PM UTC-5, Ivan Brusic wrote:

One of my clusters has 8 nodes running 0.20.0.RC1. Most settings are the
default except for:

bootstrap.mlockall: true
transport.tcp.connect_timeout: 5s
gateway.expected_nodes: 8 <-- we'll get to this in a second
discovery.zen.minimum_master_nodes: 5
discovery.zen.ping.multicast.enabled: true

As you can see, the number of expected nodes is equals to the total number
of nodes in the cluster. If one of the nodes disappears from the cluster,
the clusters stalls completely for about 1-2 minutes.

As a test, gateway.expected_nodes was reduced to 6. After changing the
setting, the cluster no longer stalls if a node disappears.

The general consensus is to set gateway.expected_nodes to the number of
nodes in the cluster. Is gateway recovery what is affecting the cluster? If
the cluster does not respond to request during recovery, shouldn't
the gateway.expected_nodes value be set to something lower in case a node
goes down?

Ivan

--

Ivan · January 15, 2013, 1:03am

Expected nodes was a red herring. The true issue might be ping timeout for
zen discovery. If a node is no longer ping-able, the cluster stalls. Doing
some tests, will write more later with more facts.

If a node disappears completely, should the cluster stall?

--
Ivan

On Mon, Jan 14, 2013 at 4:23 PM, Igor Motov imotov@gmail.com wrote:

ld have no effect after initial recovery. Are you sure tha

--

Igor_Motov · January 15, 2013, 2:20am

Yeah, more information would be useful. It might also help to set logging
level for "discovery" to TRACE to see what's actually going on with pings
and connections between nodes. I would suspect that when a node disappears,
elasticsearch might not detect it quickly enough and during this time some
of the requests are getting directed to the disappeared node. How do you
simulate node disappearance by the way?

On Monday, January 14, 2013 8:03:45 PM UTC-5, Ivan Brusic wrote:

Expected nodes was a red herring. The true issue might be ping timeout for
zen discovery. If a node is no longer ping-able, the cluster stalls. Doing
some tests, will write more later with more facts.

If a node disappears completely, should the cluster stall?

--
Ivan

On Mon, Jan 14, 2013 at 4:23 PM, Igor Motov <imo...@gmail.com<javascript:>

wrote:

ld have no effect after initial recovery. Are you sure tha

--

Ivan · January 15, 2013, 3:25am

Back at home, so I don't have much info. Already set the levels to TRACE
(uncommented the line in logging.yml). What is the difference between
discovery.zen.ping.timeout and the ping_timeout setting referenced here:
https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/discovery/zen/fd/NodesFaultDetection.java#L86

Once again, I don't have my notes right now, but after a cluster stalls,
the master logs (after the cluster comes back) something about retrying [3]
times for [30s]. Logging was set to INFO for that test, so I don't have
finer details (right now). The cluster was unresponsive during that time.

The nodes are on VMs, so the nodes disappeared when we took down the entire
VM host (2 ES nodes per host).

Discovery is done via multicast. I am assuming that the pings are multicast
pings? Correct? Not a networking guru, but are these pings different from
"normal" pings? If so, is there a command line utility that does multicast
ping?

Cheers,

Ivan

On Mon, Jan 14, 2013 at 6:20 PM, Igor Motov imotov@gmail.com wrote:

Yeah, more information would be useful. It might also help to set logging
level for "discovery" to TRACE to see what's actually going on with pings
and connections between nodes. I would suspect that when a node disappears,
elasticsearch might not detect it quickly enough and during this time some
of the requests are getting directed to the disappeared node. How do you
simulate node disappearance by the way?

On Monday, January 14, 2013 8:03:45 PM UTC-5, Ivan Brusic wrote:

Expected nodes was a red herring. The true issue might be ping timeout
for zen discovery. If a node is no longer ping-able, the cluster stalls.
Doing some tests, will write more later with more facts.

If a node disappears completely, should the cluster stall?

--
Ivan

On Mon, Jan 14, 2013 at 4:23 PM, Igor Motov imo...@gmail.com wrote:

ld have no effect after initial recovery. Are you sure tha

--

--

Igor_Motov · January 15, 2013, 2:35pm

Discovery is done via multicast, but when nodes join the cluster they
establish connections that are used for all other communication including
pings.

On Monday, January 14, 2013 10:25:23 PM UTC-5, Ivan Brusic wrote:

Back at home, so I don't have much info. Already set the levels to TRACE
(uncommented the line in logging.yml). What is the difference between
discovery.zen.ping.timeout and the ping_timeout setting referenced here:
https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/discovery/zen/fd/NodesFaultDetection.java#L86

Once again, I don't have my notes right now, but after a cluster stalls,
the master logs (after the cluster comes back) something about retrying [3]
times for [30s]. Logging was set to INFO for that test, so I don't have
finer details (right now). The cluster was unresponsive during that time.

The nodes are on VMs, so the nodes disappeared when we took down the
entire VM host (2 ES nodes per host).

Discovery is done via multicast. I am assuming that the pings are
multicast pings? Correct? Not a networking guru, but are these pings
different from "normal" pings? If so, is there a command line utility that
does multicast ping?

Cheers,

Ivan

On Mon, Jan 14, 2013 at 6:20 PM, Igor Motov <imo...@gmail.com<javascript:>

wrote:

Yeah, more information would be useful. It might also help to set logging
level for "discovery" to TRACE to see what's actually going on with pings
and connections between nodes. I would suspect that when a node disappears,
elasticsearch might not detect it quickly enough and during this time some
of the requests are getting directed to the disappeared node. How do you
simulate node disappearance by the way?

On Monday, January 14, 2013 8:03:45 PM UTC-5, Ivan Brusic wrote:

Expected nodes was a red herring. The true issue might be ping timeout
for zen discovery. If a node is no longer ping-able, the cluster stalls.
Doing some tests, will write more later with more facts.

If a node disappears completely, should the cluster stall?

--
Ivan

On Mon, Jan 14, 2013 at 4:23 PM, Igor Motov imo...@gmail.com wrote:

ld have no effect after initial recovery. Are you sure tha

--

--

Ivan · January 15, 2013, 7:38pm

Updated the config with the following timeouts

transport.tcp.connect_timeout: 5s
discovery.zen.ping.timeout: 1s <- ignored due to precedence
discovery.zen.ping_timeout: 2s
discovery.zen.fd.ping_timeout: 2s

A VM host, containing nodes search8 and search11 was taken offline. Node
search6 was the master. All nodes took over a minute between node removed
messages.

[2013-01-15 11:09:34,696][INFO ][cluster.service ] [search12]
removed {[search8][P7UNCh9oTE623RI8w_zsPw][inet[/:9300]],}, reason:
zen-disco-receive(from master
[[search6][7d_0aK3XTiiWh_OGLSffig][inet[/:9300]]])
[2013-01-15 11:10:52,616][INFO ][cluster.service ] [search12]
removed {[srch-lv111.corp.shop.com][OMMr4k2DRgSsvMRU8vE-eQ][inet[/:9300]],},
reason: zen-disco-receive(from master
[[search6][7d_0aK3XTiiWh_OGLSffig][inet[/:9300]]])

The log for the master is here: Cluster stalls upon node removal · GitHub

The cluster was in a red state and unresponsive during this time. These
outages are us testing the failover capabilities of both the VMs and the
cluster. Having the cluster go offline completely is not a good situation
to be in, but elevated search times would be acceptable.

Let me know what else I can provide to help fine-tune the issue.

Ivan

On Tue, Jan 15, 2013 at 6:35 AM, Igor Motov imotov@gmail.com wrote:

Discovery is done via multicast, but when nodes join the cluster they
establish connections that are used for all other communication including
pings.

On Monday, January 14, 2013 10:25:23 PM UTC-5, Ivan Brusic wrote:

Back at home, so I don't have much info. Already set the levels to TRACE
(uncommented the line in logging.yml). What is the difference between
discovery.zen.ping.timeout and the ping_timeout setting referenced here:
https://github.com/**elasticsearch/elasticsearch/**
blob/master/src/main/java/org/elasticsearch/discovery/zen/
fd/NodesFaultDetection.java#**L86https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/discovery/zen/fd/NodesFaultDetection.java#L86

Once again, I don't have my notes right now, but after a cluster stalls,
the master logs (after the cluster comes back) something about retrying [3]
times for [30s]. Logging was set to INFO for that test, so I don't have
finer details (right now). The cluster was unresponsive during that time.

The nodes are on VMs, so the nodes disappeared when we took down the
entire VM host (2 ES nodes per host).

Discovery is done via multicast. I am assuming that the pings are
multicast pings? Correct? Not a networking guru, but are these pings
different from "normal" pings? If so, is there a command line utility that
does multicast ping?

Cheers,

Ivan

On Mon, Jan 14, 2013 at 6:20 PM, Igor Motov imo...@gmail.com wrote:

Yeah, more information would be useful. It might also help to set
logging level for "discovery" to TRACE to see what's actually going on with
pings and connections between nodes. I would suspect that when a node
disappears, elasticsearch might not detect it quickly enough and during
this time some of the requests are getting directed to the disappeared
node. How do you simulate node disappearance by the way?

On Monday, January 14, 2013 8:03:45 PM UTC-5, Ivan Brusic wrote:

Expected nodes was a red herring. The true issue might be ping timeout
for zen discovery. If a node is no longer ping-able, the cluster stalls.
Doing some tests, will write more later with more facts.

If a node disappears completely, should the cluster stall?

--
Ivan

On Mon, Jan 14, 2013 at 4:23 PM, Igor Motov imo...@gmail.com wrote:

ld have no effect after initial recovery. Are you sure tha

--

--

--

Igor_Motov · January 15, 2013, 11:03pm

Are you running any plugins that listen to cluster state changes on your
nodes?

On Tuesday, January 15, 2013 2:38:42 PM UTC-5, Ivan Brusic wrote:

Updated the config with the following timeouts

transport.tcp.connect_timeout: 5s
discovery.zen.ping.timeout: 1s <- ignored due to precedence
discovery.zen.ping_timeout: 2s
discovery.zen.fd.ping_timeout: 2s

A VM host, containing nodes search8 and search11 was taken offline. Node
search6 was the master. All nodes took over a minute between node removed
messages.

[2013-01-15 11:09:34,696][INFO ][cluster.service ] [search12]
removed {[search8][P7UNCh9oTE623RI8w_zsPw][inet[/:9300]],}, reason:
zen-disco-receive(from master
[[search6][7d_0aK3XTiiWh_OGLSffig][inet[/:9300]]])
[2013-01-15 11:10:52,616][INFO ][cluster.service ] [search12]
removed {[srch-lv111.corp.shop.com][OMMr4k2DRgSsvMRU8vE-eQ][inet[/:9300]],},
reason: zen-disco-receive(from master
[[search6][7d_0aK3XTiiWh_OGLSffig][inet[/:9300]]])

The log for the master is here:
Cluster stalls upon node removal · GitHub

The cluster was in a red state and unresponsive during this time. These
outages are us testing the failover capabilities of both the VMs and the
cluster. Having the cluster go offline completely is not a good situation
to be in, but elevated search times would be acceptable.

Let me know what else I can provide to help fine-tune the issue.

Ivan

On Tue, Jan 15, 2013 at 6:35 AM, Igor Motov <imo...@gmail.com<javascript:>

wrote:

Discovery is done via multicast, but when nodes join the cluster they
establish connections that are used for all other communication including
pings.

On Monday, January 14, 2013 10:25:23 PM UTC-5, Ivan Brusic wrote:

Back at home, so I don't have much info. Already set the levels to
TRACE (uncommented the line in logging.yml). What is the difference between
discovery.zen.ping.timeout and the ping_timeout setting referenced here:
https://github.com/**elasticsearch/elasticsearch/**
blob/master/src/main/java/org/elasticsearch/discovery/zen/
fd/NodesFaultDetection.java#**L86https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/discovery/zen/fd/NodesFaultDetection.java#L86

Once again, I don't have my notes right now, but after a cluster stalls,
the master logs (after the cluster comes back) something about retrying [3]
times for [30s]. Logging was set to INFO for that test, so I don't have
finer details (right now). The cluster was unresponsive during that time.

The nodes are on VMs, so the nodes disappeared when we took down the
entire VM host (2 ES nodes per host).

Discovery is done via multicast. I am assuming that the pings are
multicast pings? Correct? Not a networking guru, but are these pings
different from "normal" pings? If so, is there a command line utility that
does multicast ping?

Cheers,

Ivan

On Mon, Jan 14, 2013 at 6:20 PM, Igor Motov imo...@gmail.com wrote:

Yeah, more information would be useful. It might also help to set
logging level for "discovery" to TRACE to see what's actually going on with
pings and connections between nodes. I would suspect that when a node
disappears, elasticsearch might not detect it quickly enough and during
this time some of the requests are getting directed to the disappeared
node. How do you simulate node disappearance by the way?

On Monday, January 14, 2013 8:03:45 PM UTC-5, Ivan Brusic wrote:

Expected nodes was a red herring. The true issue might be ping timeout
for zen discovery. If a node is no longer ping-able, the cluster stalls.
Doing some tests, will write more later with more facts.

If a node disappears completely, should the cluster stall?

--
Ivan

On Mon, Jan 14, 2013 at 4:23 PM, Igor Motov imo...@gmail.com wrote:

ld have no effect after initial recovery. Are you sure tha

--

--

--

Ivan · January 15, 2013, 11:13pm

None.

[INFO ][plugins ] [search8] loaded , sites [bigdesk,
head]

The timeouts proved to be too low and the cluster has been removing nodes
too quickly (duh!)). Uping the timeouts for now.

--
Ivan

On Tue, Jan 15, 2013 at 3:03 PM, Igor Motov imotov@gmail.com wrote:

Are you running any plugins that listen to cluster state changes on your
nodes?

--

Ivan · January 16, 2013, 6:20pm

Looking into the various timeouts. The first 30 second pause occurs here:

[2013-01-15 11:09:35,108][DEBUG][discovery.zen.fd ] [search6] [node
] failed to ping [[search11][OMMr4k2DRgSsvMRU8vE-eQ][inet[/:9300]]],
tried
[3] times, each with maximum [2s] timeout

[2013-01-15 11:10:04,693][DEBUG][indices.store ] [search6]
failed to execute on node [OMMr4k2DRgSsvMRU8vE-eQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[search11][inet[/:9300]][/cluster/nodes/indices/shard/store/n]
request_id [289991] timed out after [30000ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:342)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)

So TransportService blocked for 30 seconds. Can't find where how this
timeout is set. The closet I can find is the ping_timeout set
in TransportClientNodesService. I am assuming it is
client.transport.ping_timeout. Transport is threaded, so I am unsure why
the cluster stalls during fault detection.

--
Ivan

On Tue, Jan 15, 2013 at 3:13 PM, Ivan Brusic ivan@brusic.com wrote:

None.

[INFO ][plugins ] [search8] loaded , sites [bigdesk,
head]

The timeouts proved to be too low and the cluster has been removing nodes
too quickly (duh!)). Uping the timeouts for now.

--
Ivan

On Tue, Jan 15, 2013 at 3:03 PM, Igor Motov imotov@gmail.com wrote:

Are you running any plugins that listen to cluster state changes on your
nodes?

--

Topic		Replies	Views
Frequent disconnects between nodes Elasticsearch	13	2340	July 6, 2017
(ES 0.90.1) Cannot connect to elasticsearch cluster after a node is removed Elasticsearch	10	766	July 6, 2017
EC2 Cluster Failure Elasticsearch	18	791	July 6, 2017
Elastic search nodes only cluster together when started within a few minutes of eachother Elasticsearch	4	468	July 6, 2017
Master node disconnects from cluster Elasticsearch	5	454	July 6, 2017

Cluster stalls when nodes are removed (or the true meaning of expected_nodes)

Related topics