Cluster stalls when nodes are removed (or the true meaning of expected_nodes)

One of my clusters has 8 nodes running 0.20.0.RC1. Most settings are the
default except for:

bootstrap.mlockall: true
transport.tcp.connect_timeout: 5s
gateway.expected_nodes: 8 <-- we'll get to this in a second
discovery.zen.minimum_master_nodes: 5
discovery.zen.ping.multicast.enabled: true

As you can see, the number of expected nodes is equals to the total number
of nodes in the cluster. If one of the nodes disappears from the cluster,
the clusters stalls completely for about 1-2 minutes.

As a test, gateway.expected_nodes was reduced to 6. After changing the
setting, the cluster no longer stalls if a node disappears.

The general consensus is to set gateway.expected_nodes to the number of
nodes in the cluster. Is gateway recovery what is affecting the cluster? If
the cluster does not respond to request during recovery, shouldn't
the gateway.expected_nodes value be set to something lower in case a node
goes down?

Ivan

--

This setting should have no effect after initial recovery. Are you sure
that the effect that you observed wasn't a coincidence?

On Thursday, January 10, 2013 8:35:17 PM UTC-5, Ivan Brusic wrote:

One of my clusters has 8 nodes running 0.20.0.RC1. Most settings are the
default except for:

bootstrap.mlockall: true
transport.tcp.connect_timeout: 5s
gateway.expected_nodes: 8 <-- we'll get to this in a second
discovery.zen.minimum_master_nodes: 5
discovery.zen.ping.multicast.enabled: true

As you can see, the number of expected nodes is equals to the total number
of nodes in the cluster. If one of the nodes disappears from the cluster,
the clusters stalls completely for about 1-2 minutes.

As a test, gateway.expected_nodes was reduced to 6. After changing the
setting, the cluster no longer stalls if a node disappears.

The general consensus is to set gateway.expected_nodes to the number of
nodes in the cluster. Is gateway recovery what is affecting the cluster? If
the cluster does not respond to request during recovery, shouldn't
the gateway.expected_nodes value be set to something lower in case a node
goes down?

Ivan

--

Expected nodes was a red herring. The true issue might be ping timeout for
zen discovery. If a node is no longer ping-able, the cluster stalls. Doing
some tests, will write more later with more facts.

If a node disappears completely, should the cluster stall?

--
Ivan

On Mon, Jan 14, 2013 at 4:23 PM, Igor Motov imotov@gmail.com wrote:

ld have no effect after initial recovery. Are you sure tha

--

Yeah, more information would be useful. It might also help to set logging
level for "discovery" to TRACE to see what's actually going on with pings
and connections between nodes. I would suspect that when a node disappears,
elasticsearch might not detect it quickly enough and during this time some
of the requests are getting directed to the disappeared node. How do you
simulate node disappearance by the way?

On Monday, January 14, 2013 8:03:45 PM UTC-5, Ivan Brusic wrote:

Expected nodes was a red herring. The true issue might be ping timeout for
zen discovery. If a node is no longer ping-able, the cluster stalls. Doing
some tests, will write more later with more facts.

If a node disappears completely, should the cluster stall?

--
Ivan

On Mon, Jan 14, 2013 at 4:23 PM, Igor Motov <imo...@gmail.com<javascript:>

wrote:

ld have no effect after initial recovery. Are you sure tha

--

Back at home, so I don't have much info. Already set the levels to TRACE
(uncommented the line in logging.yml). What is the difference between
discovery.zen.ping.timeout and the ping_timeout setting referenced here:
https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/discovery/zen/fd/NodesFaultDetection.java#L86

Once again, I don't have my notes right now, but after a cluster stalls,
the master logs (after the cluster comes back) something about retrying [3]
times for [30s]. Logging was set to INFO for that test, so I don't have
finer details (right now). The cluster was unresponsive during that time.

The nodes are on VMs, so the nodes disappeared when we took down the entire
VM host (2 ES nodes per host).

Discovery is done via multicast. I am assuming that the pings are multicast
pings? Correct? Not a networking guru, but are these pings different from
"normal" pings? If so, is there a command line utility that does multicast
ping?

Cheers,

Ivan

On Mon, Jan 14, 2013 at 6:20 PM, Igor Motov imotov@gmail.com wrote:

Yeah, more information would be useful. It might also help to set logging
level for "discovery" to TRACE to see what's actually going on with pings
and connections between nodes. I would suspect that when a node disappears,
elasticsearch might not detect it quickly enough and during this time some
of the requests are getting directed to the disappeared node. How do you
simulate node disappearance by the way?

On Monday, January 14, 2013 8:03:45 PM UTC-5, Ivan Brusic wrote:

Expected nodes was a red herring. The true issue might be ping timeout
for zen discovery. If a node is no longer ping-able, the cluster stalls.
Doing some tests, will write more later with more facts.

If a node disappears completely, should the cluster stall?

--
Ivan

On Mon, Jan 14, 2013 at 4:23 PM, Igor Motov imo...@gmail.com wrote:

ld have no effect after initial recovery. Are you sure tha

--

--

Discovery is done via multicast, but when nodes join the cluster they
establish connections that are used for all other communication including
pings.

On Monday, January 14, 2013 10:25:23 PM UTC-5, Ivan Brusic wrote:

Back at home, so I don't have much info. Already set the levels to TRACE
(uncommented the line in logging.yml). What is the difference between
discovery.zen.ping.timeout and the ping_timeout setting referenced here:
https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/discovery/zen/fd/NodesFaultDetection.java#L86

Once again, I don't have my notes right now, but after a cluster stalls,
the master logs (after the cluster comes back) something about retrying [3]
times for [30s]. Logging was set to INFO for that test, so I don't have
finer details (right now). The cluster was unresponsive during that time.

The nodes are on VMs, so the nodes disappeared when we took down the
entire VM host (2 ES nodes per host).

Discovery is done via multicast. I am assuming that the pings are
multicast pings? Correct? Not a networking guru, but are these pings
different from "normal" pings? If so, is there a command line utility that
does multicast ping?

Cheers,

Ivan

On Mon, Jan 14, 2013 at 6:20 PM, Igor Motov <imo...@gmail.com<javascript:>

wrote:

Yeah, more information would be useful. It might also help to set logging
level for "discovery" to TRACE to see what's actually going on with pings
and connections between nodes. I would suspect that when a node disappears,
elasticsearch might not detect it quickly enough and during this time some
of the requests are getting directed to the disappeared node. How do you
simulate node disappearance by the way?

On Monday, January 14, 2013 8:03:45 PM UTC-5, Ivan Brusic wrote:

Expected nodes was a red herring. The true issue might be ping timeout
for zen discovery. If a node is no longer ping-able, the cluster stalls.
Doing some tests, will write more later with more facts.

If a node disappears completely, should the cluster stall?

--
Ivan

On Mon, Jan 14, 2013 at 4:23 PM, Igor Motov imo...@gmail.com wrote:

ld have no effect after initial recovery. Are you sure tha

--

--

Updated the config with the following timeouts

transport.tcp.connect_timeout: 5s
discovery.zen.ping.timeout: 1s <- ignored due to precedence
discovery.zen.ping_timeout: 2s
discovery.zen.fd.ping_timeout: 2s

A VM host, containing nodes search8 and search11 was taken offline. Node
search6 was the master. All nodes took over a minute between node removed
messages.

[2013-01-15 11:09:34,696][INFO ][cluster.service ] [search12]
removed {[search8][P7UNCh9oTE623RI8w_zsPw][inet[/:9300]],}, reason:
zen-disco-receive(from master
[[search6][7d_0aK3XTiiWh_OGLSffig][inet[/:9300]]])
[2013-01-15 11:10:52,616][INFO ][cluster.service ] [search12]
removed {[srch-lv111.corp.shop.com][OMMr4k2DRgSsvMRU8vE-eQ][inet[/:9300]],},
reason: zen-disco-receive(from master
[[search6][7d_0aK3XTiiWh_OGLSffig][inet[/:9300]]])

The log for the master is here: https://gist.github.com/60bdfafd6273c05a3417

The cluster was in a red state and unresponsive during this time. These
outages are us testing the failover capabilities of both the VMs and the
cluster. Having the cluster go offline completely is not a good situation
to be in, but elevated search times would be acceptable.

Let me know what else I can provide to help fine-tune the issue.

Ivan

On Tue, Jan 15, 2013 at 6:35 AM, Igor Motov imotov@gmail.com wrote:

Discovery is done via multicast, but when nodes join the cluster they
establish connections that are used for all other communication including
pings.

On Monday, January 14, 2013 10:25:23 PM UTC-5, Ivan Brusic wrote:

Back at home, so I don't have much info. Already set the levels to TRACE
(uncommented the line in logging.yml). What is the difference between
discovery.zen.ping.timeout and the ping_timeout setting referenced here:
https://github.com/elasticsearch/elasticsearch/
blob/master/src/main/java/org/elasticsearch/discovery/zen/
fd/NodesFaultDetection.java#**L86https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/discovery/zen/fd/NodesFaultDetection.java#L86

Once again, I don't have my notes right now, but after a cluster stalls,
the master logs (after the cluster comes back) something about retrying [3]
times for [30s]. Logging was set to INFO for that test, so I don't have
finer details (right now). The cluster was unresponsive during that time.

The nodes are on VMs, so the nodes disappeared when we took down the
entire VM host (2 ES nodes per host).

Discovery is done via multicast. I am assuming that the pings are
multicast pings? Correct? Not a networking guru, but are these pings
different from "normal" pings? If so, is there a command line utility that
does multicast ping?

Cheers,

Ivan

On Mon, Jan 14, 2013 at 6:20 PM, Igor Motov imo...@gmail.com wrote:

Yeah, more information would be useful. It might also help to set
logging level for "discovery" to TRACE to see what's actually going on with
pings and connections between nodes. I would suspect that when a node
disappears, elasticsearch might not detect it quickly enough and during
this time some of the requests are getting directed to the disappeared
node. How do you simulate node disappearance by the way?

On Monday, January 14, 2013 8:03:45 PM UTC-5, Ivan Brusic wrote:

Expected nodes was a red herring. The true issue might be ping timeout
for zen discovery. If a node is no longer ping-able, the cluster stalls.
Doing some tests, will write more later with more facts.

If a node disappears completely, should the cluster stall?

--
Ivan

On Mon, Jan 14, 2013 at 4:23 PM, Igor Motov imo...@gmail.com wrote:

ld have no effect after initial recovery. Are you sure tha

--

--

--

Are you running any plugins that listen to cluster state changes on your
nodes?

On Tuesday, January 15, 2013 2:38:42 PM UTC-5, Ivan Brusic wrote:

Updated the config with the following timeouts

transport.tcp.connect_timeout: 5s
discovery.zen.ping.timeout: 1s <- ignored due to precedence
discovery.zen.ping_timeout: 2s
discovery.zen.fd.ping_timeout: 2s

A VM host, containing nodes search8 and search11 was taken offline. Node
search6 was the master. All nodes took over a minute between node removed
messages.

[2013-01-15 11:09:34,696][INFO ][cluster.service ] [search12]
removed {[search8][P7UNCh9oTE623RI8w_zsPw][inet[/:9300]],}, reason:
zen-disco-receive(from master
[[search6][7d_0aK3XTiiWh_OGLSffig][inet[/:9300]]])
[2013-01-15 11:10:52,616][INFO ][cluster.service ] [search12]
removed {[srch-lv111.corp.shop.com][OMMr4k2DRgSsvMRU8vE-eQ][inet[/:9300]],},
reason: zen-disco-receive(from master
[[search6][7d_0aK3XTiiWh_OGLSffig][inet[/:9300]]])

The log for the master is here:
https://gist.github.com/60bdfafd6273c05a3417

The cluster was in a red state and unresponsive during this time. These
outages are us testing the failover capabilities of both the VMs and the
cluster. Having the cluster go offline completely is not a good situation
to be in, but elevated search times would be acceptable.

Let me know what else I can provide to help fine-tune the issue.

Ivan

On Tue, Jan 15, 2013 at 6:35 AM, Igor Motov <imo...@gmail.com<javascript:>

wrote:

Discovery is done via multicast, but when nodes join the cluster they
establish connections that are used for all other communication including
pings.

On Monday, January 14, 2013 10:25:23 PM UTC-5, Ivan Brusic wrote:

Back at home, so I don't have much info. Already set the levels to
TRACE (uncommented the line in logging.yml). What is the difference between
discovery.zen.ping.timeout and the ping_timeout setting referenced here:
https://github.com/elasticsearch/elasticsearch/
blob/master/src/main/java/org/elasticsearch/discovery/zen/
fd/NodesFaultDetection.java#**L86https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/discovery/zen/fd/NodesFaultDetection.java#L86

Once again, I don't have my notes right now, but after a cluster stalls,
the master logs (after the cluster comes back) something about retrying [3]
times for [30s]. Logging was set to INFO for that test, so I don't have
finer details (right now). The cluster was unresponsive during that time.

The nodes are on VMs, so the nodes disappeared when we took down the
entire VM host (2 ES nodes per host).

Discovery is done via multicast. I am assuming that the pings are
multicast pings? Correct? Not a networking guru, but are these pings
different from "normal" pings? If so, is there a command line utility that
does multicast ping?

Cheers,

Ivan

On Mon, Jan 14, 2013 at 6:20 PM, Igor Motov imo...@gmail.com wrote:

Yeah, more information would be useful. It might also help to set
logging level for "discovery" to TRACE to see what's actually going on with
pings and connections between nodes. I would suspect that when a node
disappears, elasticsearch might not detect it quickly enough and during
this time some of the requests are getting directed to the disappeared
node. How do you simulate node disappearance by the way?

On Monday, January 14, 2013 8:03:45 PM UTC-5, Ivan Brusic wrote:

Expected nodes was a red herring. The true issue might be ping timeout
for zen discovery. If a node is no longer ping-able, the cluster stalls.
Doing some tests, will write more later with more facts.

If a node disappears completely, should the cluster stall?

--
Ivan

On Mon, Jan 14, 2013 at 4:23 PM, Igor Motov imo...@gmail.com wrote:

ld have no effect after initial recovery. Are you sure tha

--

--

--

None.

[INFO ][plugins ] [search8] loaded [], sites [bigdesk,
head]

The timeouts proved to be too low and the cluster has been removing nodes
too quickly (duh!)). Uping the timeouts for now.

--
Ivan

On Tue, Jan 15, 2013 at 3:03 PM, Igor Motov imotov@gmail.com wrote:

Are you running any plugins that listen to cluster state changes on your
nodes?

--

Looking into the various timeouts. The first 30 second pause occurs here:

[2013-01-15 11:09:35,108][DEBUG][discovery.zen.fd ] [search6] [node
] failed to ping [[search11][OMMr4k2DRgSsvMRU8vE-eQ][inet[/:9300]]],
tried
[3] times, each with maximum [2s] timeout

[2013-01-15 11:10:04,693][DEBUG][indices.store ] [search6]
failed to execute on node [OMMr4k2DRgSsvMRU8vE-eQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[search11][inet[/:9300]][/cluster/nodes/indices/shard/store/n]
request_id [289991] timed out after [30000ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:342)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)

So TransportService blocked for 30 seconds. Can't find where how this
timeout is set. The closet I can find is the ping_timeout set
in TransportClientNodesService. I am assuming it is
client.transport.ping_timeout. Transport is threaded, so I am unsure why
the cluster stalls during fault detection.

--
Ivan

On Tue, Jan 15, 2013 at 3:13 PM, Ivan Brusic ivan@brusic.com wrote:

None.

[INFO ][plugins ] [search8] loaded [], sites [bigdesk,
head]

The timeouts proved to be too low and the cluster has been removing nodes
too quickly (duh!)). Uping the timeouts for now.

--
Ivan

On Tue, Jan 15, 2013 at 3:03 PM, Igor Motov imotov@gmail.com wrote:

Are you running any plugins that listen to cluster state changes on your
nodes?

--