Elasticsearch cluster timeout when node dies


(jjheinon) #1

I have an Elasticsearch cluster with three servers (testnode00, testnode01,
testnode02), with two Elasticsearch instances running on each server (ports
9300 and 9301). Total 6 instances.
The instances have been configured with
cluster.routing.allocation.awareness.attributes=zone,tag setting so that
instances running on the same server can both die and the cluster still
works properly.

Config file in https://gist.github.com/jjheinon/7989423

This works in real life too, I can shut down both instances on the same
server and everything still works.

Everything works fine, until I actually shut down one of the servers (i.e.
testnode01)

Then the whole cluster will become unresponsive.

The basic status requests do work:

curl 'http://testnode00:9200/'
->
{
"ok" : true,
"status" : 200,
"name" : "testnode00_ebs",
"version" : {
"number" : "0.90.5",
"build_hash" : "c8714e8e0620b62638f660f6144831792b9dedee",
"build_timestamp" : "2013-09-17T13:09:46Z",
"build_snapshot" : false,
"lucene_version" : "4.4"
},
"tagline" : "You Know, for Search"
}

Cluster health request also works:

curl 'http://testnode00:9200/_cluster/health'
->
{
"active_primary_shards":120,"active_shards":240,"cluster_name":
"test_cluster",
"initializing_shards":0,"number_of_data_nodes":6,"number_of_nodes":6,"
relocating_shards":2,"status":
"green",
"timed_out":false,"unassigned_shards":0}

but node status request times out:
curl 'http://testnode00:9200/_nodes/stats'

-> Timeout

Search requests won't work either anymore:

curl 'http://testnode00:9200/_search/?q=name:test'

-> Timeout

There's nothing visible on elasticsearch log if shutting down the server.
Iif I manually shut down both Elasticsearch instances on the server, then I
will get the node disconnect messages on the log and everything fails over
properly and all the above requests work.

[2013-12-16 15:57:59,945][DEBUG][action.admin.cluster.node.stats]
[testnode00_ebs] failed to execute on node [Th4-MYtTTdGh3wZFh3W4vA]

org.elasticsearch.transport.NodeDisconnectedException:
[testnode01_ebs][inet[/10.43.129.161:9300]][cluster/nodes/stats/n]
disconnected

Any ideas why the unicast discovery won't detect missing servers?
discovery.zen.ping.timeout does not seem to help. And why _nodes/stats
request doesn't work if one of the nodes is unresponsive?
Is there a way to tune TTL values for requests between Elasticsearch nodes?

Additional question:
Is there a way to tell cloud-aws ec2 discovery plugin to find two instances
on a single server or does it detect only the first one (on port 9300 and
not the one on 9301)?

Regards,
// Janne

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/72c88b17-8c26-4d0c-b8e6-3ef034614c96%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(jjheinon) #2

Actually figured out a solution.

For some reason, the node was running out of threads. There was plenty of
resources available, but Elasticsearch default threadpool settings were too
low.

Setting threadpool sizes manually fixed the issue, now the server failure
detection and timeouts work again:

Not quite sure yet what exact setting caused the problem.

// Janne

On Monday, December 16, 2013 6:39:38 PM UTC+2, jjheinon wrote:

I have an Elasticsearch cluster with three servers (testnode00,
testnode01, testnode02), with two Elasticsearch instances running on each
server (ports 9300 and 9301). Total 6 instances.
The instances have been configured with
cluster.routing.allocation.awareness.attributes=zone,tag setting so that
instances running on the same server can both die and the cluster still
works properly.

Config file in https://gist.github.com/jjheinon/7989423

This works in real life too, I can shut down both instances on the same
server and everything still works.

Everything works fine, until I actually shut down one of the servers (i.e.
testnode01)

Then the whole cluster will become unresponsive.

The basic status requests do work:

curl 'http://testnode00:9200/'
->
{
"ok" : true,
"status" : 200,
"name" : "testnode00_ebs",
"version" : {
"number" : "0.90.5",
"build_hash" : "c8714e8e0620b62638f660f6144831792b9dedee",
"build_timestamp" : "2013-09-17T13:09:46Z",
"build_snapshot" : false,
"lucene_version" : "4.4"
},
"tagline" : "You Know, for Search"
}

Cluster health request also works:

curl 'http://testnode00:9200/_cluster/health'
->
{
"active_primary_shards":120,"active_shards":240,"cluster_name":
"test_cluster",
"initializing_shards":0,"number_of_data_nodes":6,"number_of_nodes":6,"
relocating_shards":2,"status":
"green",
"timed_out":false,"unassigned_shards":0}

but node status request times out:
curl 'http://testnode00:9200/_nodes/stats'

-> Timeout

Search requests won't work either anymore:

curl 'http://testnode00:9200/_search/?q=name:test'

-> Timeout

There's nothing visible on elasticsearch log if shutting down the server.
Iif I manually shut down both Elasticsearch instances on the server, then
I will get the node disconnect messages on the log and everything fails
over properly and all the above requests work.

[2013-12-16 15:57:59,945][DEBUG][action.admin.cluster.node.stats]
[testnode00_ebs] failed to execute on node [Th4-MYtTTdGh3wZFh3W4vA]

org.elasticsearch.transport.NodeDisconnectedException:
[testnode01_ebs][inet[/10.43.129.161:9300]][cluster/nodes/stats/n]
disconnected

Any ideas why the unicast discovery won't detect missing servers?
discovery.zen.ping.timeout does not seem to help. And why _nodes/stats
request doesn't work if one of the nodes is unresponsive?
Is there a way to tune TTL values for requests between Elasticsearch nodes?

Additional question:
Is there a way to tell cloud-aws ec2 discovery plugin to find two
instances on a single server or does it detect only the first one (on port
9300 and not the one on 9301)?

Regards,
// Janne

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fc82ed5a-7c72-44de-ac68-99c225fa315e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #3