Elasticsearch cluster timeout when node dies

jjheinon · December 16, 2013, 4:39pm

I have an Elasticsearch cluster with three servers (testnode00, testnode01,
testnode02), with two Elasticsearch instances running on each server (ports
9300 and 9301). Total 6 instances.
The instances have been configured with
cluster.routing.allocation.awareness.attributes=zone,tag setting so that
instances running on the same server can both die and the cluster still
works properly.

Config file in https://gist.github.com/jjheinon/7989423

This works in real life too, I can shut down both instances on the same
server and everything still works.

Everything works fine, until I actually shut down one of the servers (i.e.
testnode01)

Then the whole cluster will become unresponsive.

The basic status requests do work:

curl 'http://testnode00:9200/'
->
{
"ok" : true,
"status" : 200,
"name" : "testnode00_ebs",
"version" : {
"number" : "0.90.5",
"build_hash" : "c8714e8e0620b62638f660f6144831792b9dedee",
"build_timestamp" : "2013-09-17T13:09:46Z",
"build_snapshot" : false,
"lucene_version" : "4.4"
},
"tagline" : "You Know, for Search"
}

Cluster health request also works:

curl 'http://testnode00:9200/_cluster/health'
->
{
"active_primary_shards":120,"active_shards":240,"cluster_name":
"test_cluster",
"initializing_shards":0,"number_of_data_nodes":6,"number_of_nodes":6,"
relocating_shards":2,"status":
"green",
"timed_out":false,"unassigned_shards":0}

but node status request times out:
curl 'http://testnode00:9200/_nodes/stats'

-> Timeout

Search requests won't work either anymore:

curl 'http://testnode00:9200/_search/?q=name:test'

-> Timeout

There's nothing visible on elasticsearch log if shutting down the server.
Iif I manually shut down both Elasticsearch instances on the server, then I
will get the node disconnect messages on the log and everything fails over
properly and all the above requests work.

[2013-12-16 15:57:59,945][DEBUG][action.admin.cluster.node.stats]
[testnode00_ebs] failed to execute on node [Th4-MYtTTdGh3wZFh3W4vA]

org.elasticsearch.transport.NodeDisconnectedException:
[testnode01_ebs][inet[/10.43.129.161:9300]][cluster/nodes/stats/n]
disconnected

Any ideas why the unicast discovery won't detect missing servers?
discovery.zen.ping.timeout does not seem to help. And why _nodes/stats
request doesn't work if one of the nodes is unresponsive?
Is there a way to tune TTL values for requests between Elasticsearch nodes?

Additional question:
Is there a way to tell cloud-aws ec2 discovery plugin to find two instances
on a single server or does it detect only the first one (on port 9300 and
not the one on 9301)?

Regards,
// Janne

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/72c88b17-8c26-4d0c-b8e6-3ef034614c96%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jjheinon · December 16, 2013, 9:45pm

Actually figured out a solution.

For some reason, the node was running out of threads. There was plenty of
resources available, but Elasticsearch default threadpool settings were too
low.

Setting threadpool sizes manually fixed the issue, now the server failure
detection and timeouts work again:

gist.github.com

https://gist.github.com/jjheinon/7994672

cluster settings

curl -XPUT testnode00:9200/_cluster/settings -d '
{
"persistent":{
"threadpool.get.queue_size":"1000",
"threadpool.search.queue_size":"100",
"threadpool.index.size":"50",
"threadpool.search.type":"fixed",
"threadpool.index.queue_size":"100",
"threadpool.search.size":"600",
"threadpool.search.reject_policy":"caller",

This file has been truncated. show original

Not quite sure yet what exact setting caused the problem.

// Janne

On Monday, December 16, 2013 6:39:38 PM UTC+2, jjheinon wrote:

I have an Elasticsearch cluster with three servers (testnode00,
testnode01, testnode02), with two Elasticsearch instances running on each
server (ports 9300 and 9301). Total 6 instances.
The instances have been configured with
cluster.routing.allocation.awareness.attributes=zone,tag setting so that
instances running on the same server can both die and the cluster still
works properly.

Config file in Elasticsearch config · GitHub

This works in real life too, I can shut down both instances on the same
server and everything still works.

Everything works fine, until I actually shut down one of the servers (i.e.
testnode01)

Then the whole cluster will become unresponsive.

The basic status requests do work:

curl 'http://testnode00:9200/'
->
{
"ok" : true,
"status" : 200,
"name" : "testnode00_ebs",
"version" : {
"number" : "0.90.5",
"build_hash" : "c8714e8e0620b62638f660f6144831792b9dedee",
"build_timestamp" : "2013-09-17T13:09:46Z",
"build_snapshot" : false,
"lucene_version" : "4.4"
},
"tagline" : "You Know, for Search"
}

Cluster health request also works:

curl 'http://testnode00:9200/_cluster/health'
->
{
"active_primary_shards":120,"active_shards":240,"cluster_name":
"test_cluster",
"initializing_shards":0,"number_of_data_nodes":6,"number_of_nodes":6,"
relocating_shards":2,"status":
"green",
"timed_out":false,"unassigned_shards":0}

but node status request times out:
curl 'http://testnode00:9200/_nodes/stats'

-> Timeout

Search requests won't work either anymore:

curl 'http://testnode00:9200/_search/?q=name:test'

-> Timeout

There's nothing visible on elasticsearch log if shutting down the server.
Iif I manually shut down both Elasticsearch instances on the server, then
I will get the node disconnect messages on the log and everything fails
over properly and all the above requests work.

[2013-12-16 15:57:59,945][DEBUG][action.admin.cluster.node.stats]
[testnode00_ebs] failed to execute on node [Th4-MYtTTdGh3wZFh3W4vA]

org.elasticsearch.transport.NodeDisconnectedException:
[testnode01_ebs][inet[/10.43.129.161:9300]][cluster/nodes/stats/n]
disconnected

Any ideas why the unicast discovery won't detect missing servers?
discovery.zen.ping.timeout does not seem to help. And why _nodes/stats
request doesn't work if one of the nodes is unresponsive?
Is there a way to tune TTL values for requests between Elasticsearch nodes?

Additional question:
Is there a way to tell cloud-aws ec2 discovery plugin to find two
instances on a single server or does it detect only the first one (on port
9300 and not the one on 9301)?

Regards,
// Janne

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fc82ed5a-7c72-44de-ac68-99c225fa315e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Elasticsearch cluster request timeout and slow response time Elasticsearch	0	1682	February 2, 2021
Any way to exclude not responding node from running ES cluster? Elasticsearch	2	1295	June 14, 2015
ES 6.0 timeout on cluster Elasticsearch	8	1236	December 21, 2017
Timeout notification from cluster service Elasticsearch	3	3356	August 15, 2014
Cluster become unresponsive after receiving data for sometime using EC2 Discovery Elasticsearch	0	506	February 21, 2018

Elasticsearch cluster timeout when node dies

Related topics