Replace failing disks on a single node

catwallader · September 27, 2014, 12:30am

I have a situation where I need to replace disks that are failing on a
single node in my 4 node Elasticsearch cluster. As a result I'd like to
backup the Elasticsearch data on that node only, replace the disks and then
restore the data to the new (empty) disks. I've tried shutting down the
node in question, but the remaining 3 nodes can only get to a "yellow"
state. I'm using 5 primary shards and 1 replica shard per index. I
considered using snapshot for the single node, but it seems Elasticsearch
does not support snapshot and restore for a single node, it must be done on
the whole cluster.

Is it possible to just manually copy the data from the failing disk to
another disk, replace the failing disk then copy the data back to the new
disk (starting and stopping Elasticsearch before and after this whole
process, of course)?

-- vic

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/42e5d7c4-a2ee-45da-bfe5-d0327011f52d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

dadoonet · September 27, 2014, 3:16am

Is your cluster still yellow?
It should be Green at some point unless you change some settings explicitly.

If your cluster does not index anymore, you could copy manually files in data dir and copy them on your new disk. But I wonder how you can copy from a failing disk?

I'd probably let elasticsearch do it over the wire.

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 27 sept. 2014 à 02:30, vic hargrave vichargrave@gmail.com a écrit :

I have a situation where I need to replace disks that are failing on a single node in my 4 node Elasticsearch cluster. As a result I'd like to backup the Elasticsearch data on that node only, replace the disks and then restore the data to the new (empty) disks. I've tried shutting down the node in question, but the remaining 3 nodes can only get to a "yellow" state. I'm using 5 primary shards and 1 replica shard per index. I considered using snapshot for the single node, but it seems Elasticsearch does not support snapshot and restore for a single node, it must be done on the whole cluster.

Is it possible to just manually copy the data from the failing disk to another disk, replace the failing disk then copy the data back to the new disk (starting and stopping Elasticsearch before and after this whole process, of course)?

-- vic

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/42e5d7c4-a2ee-45da-bfe5-d0327011f52d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/AD489582-C1CE-4283-B06A-A5716B657E76%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

catwallader · September 27, 2014, 3:12pm

The cluster goes to yellow fairly quickly but never reaches a green state.
If I knew that new replicas would be generated from the primaries when I
add fresh disks, I would just go ahead and replace the failing disks at
that point.

When I say "failing" disks, I mean the indicator lights on the disks in the
system chassis indicate that they are exhibiting errors. I can see that
this affects the ingestion rate of the cluster so I want to replace them
before they fail completely. I have had this happen before with another
system. When disks start to go bad Elasticsearch has trouble getting
cluster status of the node with the failing disk and slows down to a crawl.
It is best to try to replace disks before they fail completely when
Elasticsearch is involved.

Anyhow, I think the Elasticsearch dev folks should think about this failure
scenario. It would be great if they added the capability to snapshot a
single node after disabling shard reallocation

Elasticsearch Platform — Find real-time answers at scale | Elastic.
As it stands now, replacing a failing or failed disk in a node is a
troublesome prospect.

On Friday, September 26, 2014 8:16:50 PM UTC-7, David Pilato wrote:

Is your cluster still yellow?
It should be Green at some point unless you change some settings
explicitly.

If your cluster does not index anymore, you could copy manually files in
data dir and copy them on your new disk. But I wonder how you can copy from
a failing disk?

I'd probably let elasticsearch do it over the wire.

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 27 sept. 2014 à 02:30, vic hargrave <vicha...@gmail.com <javascript:>>
a écrit :

I have a situation where I need to replace disks that are failing on a
single node in my 4 node Elasticsearch cluster. As a result I'd like to
backup the Elasticsearch data on that node only, replace the disks and then
restore the data to the new (empty) disks. I've tried shutting down the
node in question, but the remaining 3 nodes can only get to a "yellow"
state. I'm using 5 primary shards and 1 replica shard per index. I
considered using snapshot for the single node, but it seems Elasticsearch
does not support snapshot and restore for a single node, it must be done on
the whole cluster.

Is it possible to just manually copy the data from the failing disk to
another disk, replace the failing disk then copy the data back to the new
disk (starting and stopping Elasticsearch before and after this whole
process, of course)?

-- vic

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/42e5d7c4-a2ee-45da-bfe5-d0327011f52d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/42e5d7c4-a2ee-45da-bfe5-d0327011f52d%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dbe5d1a2-d377-4982-a2e5-e55024f2c4b4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

warkolm · September 27, 2014, 10:09pm

Disk issues are not really something ES should have to worry about, you
should either be running redundancy on the physical layer or accepting that
if you don't situations like this will occur.

If you remove the node and the cluster is yellow then just replace the
disk. Yellow indicates replica shards unallocated which means your primary
shards are still OK. You can confirm this using the _cat API or a visual
tool like kopf.
Then when you add the node back it will rebalance and you should again
reach green status.

The node snapshot does sound interesting though and might be useful, if
you're wanting this functionality then it'd be worth creating a github
issue with the request.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 28 September 2014 01:12, vic hargrave vichargrave@gmail.com wrote:

The cluster goes to yellow fairly quickly but never reaches a green
state. If I knew that new replicas would be generated from the primaries
when I add fresh disks, I would just go ahead and replace the failing disks
at that point.

When I say "failing" disks, I mean the indicator lights on the disks in
the system chassis indicate that they are exhibiting errors. I can see
that this affects the ingestion rate of the cluster so I want to replace
them before they fail completely. I have had this happen before with
another system. When disks start to go bad Elasticsearch has trouble
getting cluster status of the node with the failing disk and slows down to
a crawl. It is best to try to replace disks before they fail completely
when Elasticsearch is involved.

Anyhow, I think the Elasticsearch dev folks should think about this
failure scenario. It would be great if they added the capability to
snapshot a single node after disabling shard reallocation -
Elasticsearch Platform — Find real-time answers at scale | Elastic.
As it stands now, replacing a failing or failed disk in a node is a
troublesome prospect.

On Friday, September 26, 2014 8:16:50 PM UTC-7, David Pilato wrote:

Is your cluster still yellow?
It should be Green at some point unless you change some settings
explicitly.

If your cluster does not index anymore, you could copy manually files in
data dir and copy them on your new disk. But I wonder how you can copy from
a failing disk?

I'd probably let elasticsearch do it over the wire.

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 27 sept. 2014 à 02:30, vic hargrave vicha...@gmail.com a écrit :

I have a situation where I need to replace disks that are failing on a
single node in my 4 node Elasticsearch cluster. As a result I'd like to
backup the Elasticsearch data on that node only, replace the disks and then
restore the data to the new (empty) disks. I've tried shutting down the
node in question, but the remaining 3 nodes can only get to a "yellow"
state. I'm using 5 primary shards and 1 replica shard per index. I
considered using snapshot for the single node, but it seems Elasticsearch
does not support snapshot and restore for a single node, it must be done on
the whole cluster.

Is it possible to just manually copy the data from the failing disk to
another disk, replace the failing disk then copy the data back to the new
disk (starting and stopping Elasticsearch before and after this whole
process, of course)?

-- vic

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/42e5d7c4-a2ee-45da-bfe5-d0327011f52d%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/42e5d7c4-a2ee-45da-bfe5-d0327011f52d%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/dbe5d1a2-d377-4982-a2e5-e55024f2c4b4%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/dbe5d1a2-d377-4982-a2e5-e55024f2c4b4%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624bVPA%3D-iqH8%2B3Xjv%2BEKQy6AJkABo08LJk8U0GasWfxT3A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
How to replace failed disk when using multiple path.data entries? Elasticsearch	4	2708	July 5, 2017
Replacing a node but keeping data disk as-is Elasticsearch	5	556	May 19, 2017
Replace a node as part of a planned upgrade Elasticsearch	1	336	July 12, 2018
How does a node behave with the failure of a data disk? Elasticsearch	5	1423	July 6, 2017
Migrated elasticsearch data from a failed node Elasticsearch	2	141	January 18, 2024

Replace failing disks on a single node

-- vic

Related topics