Cluster hanging on node failure

Max_Charas · February 18, 2015, 7:30pm

Hello all of you bright people,

We’re currently running a smallish 300 GB cluster in production on 5 nodes
with around 30 mil docs. Everything works flawlessly except when a node
really goes down (I mean like network/ HW failure/ kill -9).

When we lose a node the cluster becomes more or less completely
unresponsive for a few minutes. Both regarding indexing and querying. This
is of course, less than ideal as we have load 24/7.

I would really appreciate some help with understanding best practice
settings to have a robust cluster.

First goal for us is for the cluster to not become unresponsive in the
event of a node crash. After reading everything I could find on the web I
can't really understand if ES is designed to be unresponsive for
ping_retries*ping_timeout seconds or if the cluster will continue to server
query requests even during this time. Could anyone help me shed light on
this?

Secondly in the event of a even worse failure where the cluster goes into
red state, would it be possible to allow the cluster to still serve
read/query requests?

I would be ever so grateful for anyone willing to help me understand how
this works or what we would need to change to make our ES installation more
robust.

I’ve included our config here:

cluster.name: clustername

node.name: nodename

path.data: /index

node.master: true

node.data: true

discovery.zen.minimum_master_nodes: 3

discovery.zen.ping.multicast.enabled: false

discovery.zen.ping.multicast.ping.enabled: false

discovery.zen.ping.unicast.enabled: true

discovery.zen.ping.unicast.hosts: ["host1","host2","host3"]

bootstrap.mlockall: true

index.number_of_shards: 10

action.disable_delete_all_indices: true

marvel.agent.exporter.es.hosts: ["marvel:9200"]

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bb1d307b-8c00-469d-81fb-8067942d02ad%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Max_Charas · February 19, 2015, 6:21pm

I posted here too:

Would love to get some help with this.

Best,
Max

Den onsdag 18 februari 2015 kl. 20:30:46 UTC+1 skrev Max Charas:

Hello all of you bright people,

We’re currently running a smallish 300 GB cluster in production on 5 nodes
with around 30 mil docs. Everything works flawlessly except when a node
really goes down (I mean like network/ HW failure/ kill -9).

When we lose a node the cluster becomes more or less completely
unresponsive for a few minutes. Both regarding indexing and querying. This
is of course, less than ideal as we have load 24/7.

I would really appreciate some help with understanding best practice
settings to have a robust cluster.

First goal for us is for the cluster to not become unresponsive in the
event of a node crash. After reading everything I could find on the web I
can't really understand if ES is designed to be unresponsive for
ping_retries*ping_timeout seconds or if the cluster will continue to server
query requests even during this time. Could anyone help me shed light on
this?

Secondly in the event of a even worse failure where the cluster goes into
red state, would it be possible to allow the cluster to still serve
read/query requests?

I would be ever so grateful for anyone willing to help me understand how
this works or what we would need to change to make our ES installation more
robust.

I’ve included our config here:

cluster.name: clustername

node.name: nodename

path.data: /index

node.master: true

node.data: true

discovery.zen.minimum_master_nodes: 3

discovery.zen.ping.multicast.enabled: false

discovery.zen.ping.multicast.ping.enabled: false

discovery.zen.ping.unicast.enabled: true

discovery.zen.ping.unicast.hosts: ["host1","host2","host3"]

bootstrap.mlockall: true

index.number_of_shards: 10

action.disable_delete_all_indices: true

marvel.agent.exporter.es.hosts: ["marvel:9200"]

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fb7171cd-a55e-4ccb-b15f-a6159931b3ff%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Master node hangs when multiple data nodes are shutdown at the same time Elasticsearch	6	954	July 6, 2017
Cluster failures Elasticsearch	2	284	July 6, 2017
Long period of querying failure during node timeout Elasticsearch	4	1040	May 15, 2020
Node will not shut down Elasticsearch	5	410	July 6, 2017
Single Node Configuration Advice Elasticsearch	2	457	July 6, 2017

Cluster hanging on node failure

Related topics