How to handle system failures in Elasticsearch cluster

Hi,

Yesterday, hard disks on one of our node went bad and we have to bring down
the physical machine which were running another 2 nodes of elasticsearch.
We have hourly index with replication 2 and 50 shards per index. Each
shards is currently 5 - 6 GB in size. It is more than 24 hrs and cluster is
still trying to assign unassign shards. During this RED status our search
is broken. Any recommandation how to handle such situations ?

Darsh

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/85a4986e-e5bb-403f-95f0-80cd4be8287e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

How many nodes did/do you have? What do your logs show?

You should look at using

if you are running multiple nodes per physical machine.

On 14 January 2015 at 10:22, Darsh darsh.patil@gmail.com wrote:

Hi,

Yesterday, hard disks on one of our node went bad and we have to bring
down the physical machine which were running another 2 nodes of
elasticsearch. We have hourly index with replication 2 and 50 shards per
index. Each shards is currently 5 - 6 GB in size. It is more than 24 hrs
and cluster is still trying to assign unassign shards. During this RED
status our search is broken. Any recommandation how to handle such
situations ?

Darsh

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/85a4986e-e5bb-403f-95f0-80cd4be8287e%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/85a4986e-e5bb-403f-95f0-80cd4be8287e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_bgzLRvs2SL3oFCS%3DXeRPcpH3Tb6aP_Oe7Kxn7tDXtiQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi Mark,

Thank you for your reply. Here is our cluster info

40 Physical machines with 200 GB RAM
Each machine has 3 data nodes of ES with 30 GB RAM so total 120 data nodes.
5 dedicated master nodes.
We are using 32 RAID and 22 RAID on each physical machine,
I didn't find much in logs other than logs related to initializing shards
We do have cluster.routing.allocation.same_shard.host: true but nothing
related to rack aweareness.Something we will look into it.

On Tue, Jan 13, 2015 at 2:12 PM, Mark Walkom markwalkom@gmail.com wrote:

How many nodes did/do you have? What do your logs show?

You should look at using
Elasticsearch Platform — Find real-time answers at scale | Elastic
if you are running multiple nodes per physical machine.

On 14 January 2015 at 10:22, Darsh darsh.patil@gmail.com wrote:

Hi,

Yesterday, hard disks on one of our node went bad and we have to bring
down the physical machine which were running another 2 nodes of
elasticsearch. We have hourly index with replication 2 and 50 shards per
index. Each shards is currently 5 - 6 GB in size. It is more than 24 hrs
and cluster is still trying to assign unassign shards. During this RED
status our search is broken. Any recommandation how to handle such
situations ?

Darsh

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/85a4986e-e5bb-403f-95f0-80cd4be8287e%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/85a4986e-e5bb-403f-95f0-80cd4be8287e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Yv5kSX2baa8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_bgzLRvs2SL3oFCS%3DXeRPcpH3Tb6aP_Oe7Kxn7tDXtiQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_bgzLRvs2SL3oFCS%3DXeRPcpH3Tb6aP_Oe7Kxn7tDXtiQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Thanks

Darsh

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJpS_Sok2n7pZNR%3DKM0ur1YrhN9YeYCbyiXv1%2BG2Ywdh%3DEYtVg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

What version of ES and java are you on?
Is your cluster still red? Check the _cat/allocation, _cat/indices and
_cat/recovery endpoints for info on the status of things.

On 14 January 2015 at 12:51, Darsh darsh.patil@gmail.com wrote:

Hi Mark,

Thank you for your reply. Here is our cluster info

40 Physical machines with 200 GB RAM
Each machine has 3 data nodes of ES with 30 GB RAM so total 120 data nodes.
5 dedicated master nodes.
We are using 32 RAID and 22 RAID on each physical machine,
I didn't find much in logs other than logs related to initializing shards
We do have cluster.routing.allocation.same_shard.host: true but nothing
related to rack aweareness.Something we will look into it.

On Tue, Jan 13, 2015 at 2:12 PM, Mark Walkom markwalkom@gmail.com wrote:

How many nodes did/do you have? What do your logs show?

You should look at using
Elasticsearch Platform — Find real-time answers at scale | Elastic
if you are running multiple nodes per physical machine.

On 14 January 2015 at 10:22, Darsh darsh.patil@gmail.com wrote:

Hi,

Yesterday, hard disks on one of our node went bad and we have to bring
down the physical machine which were running another 2 nodes of
elasticsearch. We have hourly index with replication 2 and 50 shards per
index. Each shards is currently 5 - 6 GB in size. It is more than 24 hrs
and cluster is still trying to assign unassign shards. During this RED
status our search is broken. Any recommandation how to handle such
situations ?

Darsh

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/85a4986e-e5bb-403f-95f0-80cd4be8287e%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/85a4986e-e5bb-403f-95f0-80cd4be8287e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Yv5kSX2baa8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_bgzLRvs2SL3oFCS%3DXeRPcpH3Tb6aP_Oe7Kxn7tDXtiQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_bgzLRvs2SL3oFCS%3DXeRPcpH3Tb6aP_Oe7Kxn7tDXtiQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Thanks

Darsh

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAJpS_Sok2n7pZNR%3DKM0ur1YrhN9YeYCbyiXv1%2BG2Ywdh%3DEYtVg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAJpS_Sok2n7pZNR%3DKM0ur1YrhN9YeYCbyiXv1%2BG2Ywdh%3DEYtVg%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X9Pq034fh7wa0U_uECOGQ5UN-%3DWJ-XdP7RPSCxqAZYrRQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

We are using 1.4.1 version of ES. Cluster is in green status now.

I think these settings as you pointed out will help. Since our cluster size
is huge i think default 2 will be very slow.

cluster.routing.allocation.cluster_concurrent_rebalance
cluster.routing.allocation.node_concurrent_recoveries

On Tue, Jan 13, 2015 at 4:58 PM, Mark Walkom markwalkom@gmail.com wrote:

What version of ES and java are you on?
Is your cluster still red? Check the _cat/allocation, _cat/indices and
_cat/recovery endpoints for info on the status of things.

On 14 January 2015 at 12:51, Darsh darsh.patil@gmail.com wrote:

Hi Mark,

Thank you for your reply. Here is our cluster info

40 Physical machines with 200 GB RAM
Each machine has 3 data nodes of ES with 30 GB RAM so total 120 data
nodes.
5 dedicated master nodes.
We are using 32 RAID and 22 RAID on each physical machine,
I didn't find much in logs other than logs related to initializing
shards
We do have cluster.routing.allocation.same_shard.host: true but nothing
related to rack aweareness.Something we will look into it.

On Tue, Jan 13, 2015 at 2:12 PM, Mark Walkom markwalkom@gmail.com
wrote:

How many nodes did/do you have? What do your logs show?

You should look at using
Elasticsearch Platform — Find real-time answers at scale | Elastic
if you are running multiple nodes per physical machine.

On 14 January 2015 at 10:22, Darsh darsh.patil@gmail.com wrote:

Hi,

Yesterday, hard disks on one of our node went bad and we have to bring
down the physical machine which were running another 2 nodes of
elasticsearch. We have hourly index with replication 2 and 50 shards per
index. Each shards is currently 5 - 6 GB in size. It is more than 24 hrs
and cluster is still trying to assign unassign shards. During this RED
status our search is broken. Any recommandation how to handle such
situations ?

Darsh

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/85a4986e-e5bb-403f-95f0-80cd4be8287e%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/85a4986e-e5bb-403f-95f0-80cd4be8287e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Yv5kSX2baa8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_bgzLRvs2SL3oFCS%3DXeRPcpH3Tb6aP_Oe7Kxn7tDXtiQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_bgzLRvs2SL3oFCS%3DXeRPcpH3Tb6aP_Oe7Kxn7tDXtiQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Thanks

Darsh

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAJpS_Sok2n7pZNR%3DKM0ur1YrhN9YeYCbyiXv1%2BG2Ywdh%3DEYtVg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAJpS_Sok2n7pZNR%3DKM0ur1YrhN9YeYCbyiXv1%2BG2Ywdh%3DEYtVg%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Yv5kSX2baa8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X9Pq034fh7wa0U_uECOGQ5UN-%3DWJ-XdP7RPSCxqAZYrRQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X9Pq034fh7wa0U_uECOGQ5UN-%3DWJ-XdP7RPSCxqAZYrRQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Thanks

Darsh

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJpS_So-1Fuv5-k2hPb_%3Dc0-41Rv3UiOi3Zk4qX9Ly4apCGKww%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.