How to handle system failures in Elasticsearch cluster

ddp · January 13, 2015, 9:22pm

Hi,

Yesterday, hard disks on one of our node went bad and we have to bring down
the physical machine which were running another 2 nodes of elasticsearch.
We have hourly index with replication 2 and 50 shards per index. Each
shards is currently 5 - 6 GB in size. It is more than 24 hrs and cluster is
still trying to assign unassign shards. During this RED status our search
is broken. Any recommandation how to handle such situations ?

Darsh

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/85a4986e-e5bb-403f-95f0-80cd4be8287e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

warkolm · January 13, 2015, 10:12pm

How many nodes did/do you have? What do your logs show?

You should look at using

if you are running multiple nodes per physical machine.

On 14 January 2015 at 10:22, Darsh darsh.patil@gmail.com wrote:

Hi,

Yesterday, hard disks on one of our node went bad and we have to bring
down the physical machine which were running another 2 nodes of
elasticsearch. We have hourly index with replication 2 and 50 shards per
index. Each shards is currently 5 - 6 GB in size. It is more than 24 hrs
and cluster is still trying to assign unassign shards. During this RED
status our search is broken. Any recommandation how to handle such
situations ?

Darsh

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/85a4986e-e5bb-403f-95f0-80cd4be8287e%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/85a4986e-e5bb-403f-95f0-80cd4be8287e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_bgzLRvs2SL3oFCS%3DXeRPcpH3Tb6aP_Oe7Kxn7tDXtiQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

ddp · January 13, 2015, 11:51pm

Hi Mark,

Thank you for your reply. Here is our cluster info

40 Physical machines with 200 GB RAM
Each machine has 3 data nodes of ES with 30 GB RAM so total 120 data nodes.
5 dedicated master nodes.
We are using 32 RAID and 22 RAID on each physical machine,
I didn't find much in logs other than logs related to initializing shards
We do have cluster.routing.allocation.same_shard.host: true but nothing
related to rack aweareness.Something we will look into it.

On Tue, Jan 13, 2015 at 2:12 PM, Mark Walkom markwalkom@gmail.com wrote:

How many nodes did/do you have? What do your logs show?

You should look at using
Elasticsearch Platform — Find real-time answers at scale | Elastic
if you are running multiple nodes per physical machine.

On 14 January 2015 at 10:22, Darsh darsh.patil@gmail.com wrote:

Hi,

Yesterday, hard disks on one of our node went bad and we have to bring
down the physical machine which were running another 2 nodes of
elasticsearch. We have hourly index with replication 2 and 50 shards per
index. Each shards is currently 5 - 6 GB in size. It is more than 24 hrs
and cluster is still trying to assign unassign shards. During this RED
status our search is broken. Any recommandation how to handle such
situations ?

Darsh

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/85a4986e-e5bb-403f-95f0-80cd4be8287e%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/85a4986e-e5bb-403f-95f0-80cd4be8287e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Yv5kSX2baa8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_bgzLRvs2SL3oFCS%3DXeRPcpH3Tb6aP_Oe7Kxn7tDXtiQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_bgzLRvs2SL3oFCS%3DXeRPcpH3Tb6aP_Oe7Kxn7tDXtiQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Thanks

Darsh

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJpS_Sok2n7pZNR%3DKM0ur1YrhN9YeYCbyiXv1%2BG2Ywdh%3DEYtVg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

warkolm · January 14, 2015, 12:58am

What version of ES and java are you on?
Is your cluster still red? Check the _cat/allocation, _cat/indices and
_cat/recovery endpoints for info on the status of things.

On 14 January 2015 at 12:51, Darsh darsh.patil@gmail.com wrote:

Hi Mark,

Thank you for your reply. Here is our cluster info

40 Physical machines with 200 GB RAM
Each machine has 3 data nodes of ES with 30 GB RAM so total 120 data nodes.
5 dedicated master nodes.
We are using 32 RAID and 22 RAID on each physical machine,
I didn't find much in logs other than logs related to initializing shards
We do have cluster.routing.allocation.same_shard.host: true but nothing
related to rack aweareness.Something we will look into it.

On Tue, Jan 13, 2015 at 2:12 PM, Mark Walkom markwalkom@gmail.com wrote:

How many nodes did/do you have? What do your logs show?

You should look at using
Elasticsearch Platform — Find real-time answers at scale | Elastic
if you are running multiple nodes per physical machine.

On 14 January 2015 at 10:22, Darsh darsh.patil@gmail.com wrote:

Hi,

Yesterday, hard disks on one of our node went bad and we have to bring
down the physical machine which were running another 2 nodes of
elasticsearch. We have hourly index with replication 2 and 50 shards per
index. Each shards is currently 5 - 6 GB in size. It is more than 24 hrs
and cluster is still trying to assign unassign shards. During this RED
status our search is broken. Any recommandation how to handle such
situations ?

Darsh

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/85a4986e-e5bb-403f-95f0-80cd4be8287e%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/85a4986e-e5bb-403f-95f0-80cd4be8287e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Yv5kSX2baa8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_bgzLRvs2SL3oFCS%3DXeRPcpH3Tb6aP_Oe7Kxn7tDXtiQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_bgzLRvs2SL3oFCS%3DXeRPcpH3Tb6aP_Oe7Kxn7tDXtiQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Thanks

Darsh

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAJpS_Sok2n7pZNR%3DKM0ur1YrhN9YeYCbyiXv1%2BG2Ywdh%3DEYtVg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAJpS_Sok2n7pZNR%3DKM0ur1YrhN9YeYCbyiXv1%2BG2Ywdh%3DEYtVg%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X9Pq034fh7wa0U_uECOGQ5UN-%3DWJ-XdP7RPSCxqAZYrRQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

ddp · January 14, 2015, 1:31am

We are using 1.4.1 version of ES. Cluster is in green status now.

I think these settings as you pointed out will help. Since our cluster size
is huge i think default 2 will be very slow.

cluster.routing.allocation.cluster_concurrent_rebalance
cluster.routing.allocation.node_concurrent_recoveries

On Tue, Jan 13, 2015 at 4:58 PM, Mark Walkom markwalkom@gmail.com wrote:

What version of ES and java are you on?
Is your cluster still red? Check the _cat/allocation, _cat/indices and
_cat/recovery endpoints for info on the status of things.

On 14 January 2015 at 12:51, Darsh darsh.patil@gmail.com wrote:

Hi Mark,

Thank you for your reply. Here is our cluster info

40 Physical machines with 200 GB RAM
Each machine has 3 data nodes of ES with 30 GB RAM so total 120 data
nodes.
5 dedicated master nodes.
We are using 32 RAID and 22 RAID on each physical machine,
I didn't find much in logs other than logs related to initializing
shards
We do have cluster.routing.allocation.same_shard.host: true but nothing
related to rack aweareness.Something we will look into it.

On Tue, Jan 13, 2015 at 2:12 PM, Mark Walkom markwalkom@gmail.com
wrote:

How many nodes did/do you have? What do your logs show?

You should look at using
Elasticsearch Platform — Find real-time answers at scale | Elastic
if you are running multiple nodes per physical machine.

On 14 January 2015 at 10:22, Darsh darsh.patil@gmail.com wrote:

Hi,

Yesterday, hard disks on one of our node went bad and we have to bring
down the physical machine which were running another 2 nodes of
elasticsearch. We have hourly index with replication 2 and 50 shards per
index. Each shards is currently 5 - 6 GB in size. It is more than 24 hrs
and cluster is still trying to assign unassign shards. During this RED
status our search is broken. Any recommandation how to handle such
situations ?

Darsh

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/85a4986e-e5bb-403f-95f0-80cd4be8287e%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/85a4986e-e5bb-403f-95f0-80cd4be8287e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Yv5kSX2baa8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_bgzLRvs2SL3oFCS%3DXeRPcpH3Tb6aP_Oe7Kxn7tDXtiQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_bgzLRvs2SL3oFCS%3DXeRPcpH3Tb6aP_Oe7Kxn7tDXtiQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Thanks

Darsh

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAJpS_Sok2n7pZNR%3DKM0ur1YrhN9YeYCbyiXv1%2BG2Ywdh%3DEYtVg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAJpS_Sok2n7pZNR%3DKM0ur1YrhN9YeYCbyiXv1%2BG2Ywdh%3DEYtVg%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Yv5kSX2baa8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X9Pq034fh7wa0U_uECOGQ5UN-%3DWJ-XdP7RPSCxqAZYrRQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X9Pq034fh7wa0U_uECOGQ5UN-%3DWJ-XdP7RPSCxqAZYrRQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Thanks

Darsh

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJpS_So-1Fuv5-k2hPb_%3Dc0-41Rv3UiOi3Zk4qX9Ly4apCGKww%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Unassigned Node and shards Elasticsearch	3	365	July 6, 2017
Failed recovering a shard after cluster restart Elasticsearch	2	331	July 6, 2017
Disk usage not banalced Elasticsearch	2	348	July 6, 2017
Shards become unallocated during indexing Elasticsearch	2	468	July 6, 2017
Moving whole old index Elasticsearch	5	1514	July 6, 2017

How to handle system failures in Elasticsearch cluster

Related Topics