Cluster hanging on node failure

Hello all of you bright people,

We’re currently running a smallish 300 GB cluster in production on 5 nodes
with around 30 mil docs. Everything works flawlessly except when a node
really goes down (I mean like network/ HW failure/ kill -9).

When we lose a node the cluster becomes more or less completely
unresponsive for a few minutes. Both regarding indexing and querying. This
is of course, less than ideal as we have load 24/7.

I would really appreciate some help with understanding best practice
settings to have a robust cluster.

First goal for us is for the cluster to not become unresponsive in the
event of a node crash. After reading everything I could find on the web I
can't really understand if ES is designed to be unresponsive for
ping_retries*ping_timeout seconds or if the cluster will continue to server
query requests even during this time. Could anyone help me shed light on
this?

Secondly in the event of a even worse failure where the cluster goes into
red state, would it be possible to allow the cluster to still serve
read/query requests?

I would be ever so grateful for anyone willing to help me understand how
this works or what we would need to change to make our ES installation more
robust.

I’ve included our config here:

cluster.name: clustername

node.name: nodename

path.data: /index

node.master: true

node.data: true

discovery.zen.minimum_master_nodes: 3

discovery.zen.ping.multicast.enabled: false

discovery.zen.ping.multicast.ping.enabled: false

discovery.zen.ping.unicast.enabled: true

discovery.zen.ping.unicast.hosts: ["host1","host2","host3"]

bootstrap.mlockall: true

index.number_of_shards: 10

action.disable_delete_all_indices: true

marvel.agent.exporter.es.hosts: ["marvel:9200"]

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bb1d307b-8c00-469d-81fb-8067942d02ad%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I posted here too:

Would love to get some help with this.

Best,
Max

Den onsdag 18 februari 2015 kl. 20:30:46 UTC+1 skrev Max Charas:

Hello all of you bright people,

We’re currently running a smallish 300 GB cluster in production on 5 nodes
with around 30 mil docs. Everything works flawlessly except when a node
really goes down (I mean like network/ HW failure/ kill -9).

When we lose a node the cluster becomes more or less completely
unresponsive for a few minutes. Both regarding indexing and querying. This
is of course, less than ideal as we have load 24/7.

I would really appreciate some help with understanding best practice
settings to have a robust cluster.

First goal for us is for the cluster to not become unresponsive in the
event of a node crash. After reading everything I could find on the web I
can't really understand if ES is designed to be unresponsive for
ping_retries*ping_timeout seconds or if the cluster will continue to server
query requests even during this time. Could anyone help me shed light on
this?

Secondly in the event of a even worse failure where the cluster goes into
red state, would it be possible to allow the cluster to still serve
read/query requests?

I would be ever so grateful for anyone willing to help me understand how
this works or what we would need to change to make our ES installation more
robust.

I’ve included our config here:

cluster.name: clustername

node.name: nodename

path.data: /index

node.master: true

node.data: true

discovery.zen.minimum_master_nodes: 3

discovery.zen.ping.multicast.enabled: false

discovery.zen.ping.multicast.ping.enabled: false

discovery.zen.ping.unicast.enabled: true

discovery.zen.ping.unicast.hosts: ["host1","host2","host3"]

bootstrap.mlockall: true

index.number_of_shards: 10

action.disable_delete_all_indices: true

marvel.agent.exporter.es.hosts: ["marvel:9200"]

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fb7171cd-a55e-4ccb-b15f-a6159931b3ff%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.