How to replace failed disk when using multiple entries?

(John Ouellette) #1

Hi -- I have a small ES cluster with two data nodes. Each data node is running one ES instance with multiple entries, one for each data disk on that node (six for each node). One disk on one node failed, and I am trying to figure out a procedure for replacing it.

What I did:

  1. Set cluster.routing.allocation.enable to none.
  2. Mark each index on that disk to have 'replicas: 0' (This might have been an extraneous step)
  3. Shut down ES on the node with the failed disk.
  4. Removed the disk, replaced it with a new disk, partitioned, remounted, etc, identically to the failed disk.
  5. Restarted ES on that node.
  6. Set replicas:1 for the affected indexes.
  7. Set cluster.routing.allocation.enable: all

What I expected to happen was that the replicas might get assigned to another disk and I expected data to go to the new disk, eventually. My cluster is still allocating shards (after two days), and I am not yet sure if data will ever go to the replaced disk.

What I am hoping to get help with is:

  • is there a better way to replace a failed disk in ES than what I have described above?
  • how can I speed up the shard allocation?

More details about my cluster:

  • each data node has 96GB RAM, 24 cores, six data disks (600GB SAS, not SSD)
  • run ES 2.0 using openJDK on a CentOS 6.7 system.
  • before you ask: RAID0 was not used because I didn't want to have to recover 3.6TB if a disk failed (maybe that was not a great decision, since it seems that I am doing that anyway), and I did not use any parity-based RAID or RAID10 due to space considerations, and the expectation that multiple entries would give me (roughly) the equivalent of RAID0.
  • I have multiple indexes, each with one shard and one replica. There are about 2bn documents indexed, comprising about 1TB of data.

Thanks for any insight.
John Ouellette

(Vincent Tran) #2

You do not need to set "number_of_replicas" : 0. Assuming your cluster health is green with "number_of_replicas" : 1 currently, you can just:

  1. cluster.routing.allocation.enable => none
  2. shut down the ES node with the failed disk (your remaining node still contains all the shards in form of primary and replica at this stage)
  3. Replace disk, re-partition and remount
  4. Restart ES.
  5. cluster.routing.allocation.enable => all

When the repaired node comes back online, there might be some minimal shard re-allocation, but it won't be as bad as you imagine. ES might try to rebalance the shards from the other 5 data.paths to the newly repaired 6th. But there won't be node-to-node shard reallocation.

(Jörg Prante) #3

RAID0 can not be compared to multiple

I use RAID0 on an 1.8TB array on each node for hardware I/O speed: under RAID0, each disk speed adds up to the disk array speed until the controller bandwidth is saturated. This is not the case with multiple

Here, ES data is on a separate filesystem mount. If a disk fails, the node fails, and the node shutdown will cause a monitoring alert. And that's ok because of replica. Replica level >= 1 asserts that all data will continue to be available. In the meantime, the broken disk can be replaced and the node can be rebooted and restarted.

(John Ouellette) #4

My replacement disk is now being populated by ES, so everything is working, at least to that point. I think you're right that I didn't need to change the number of replicas for the indexes on the failed disk.

The one thing that does not seem to be working is that, a week later, the cluster still has not finished reallocating all of the shards! The data volume isn't huge: about 1TB in 2 billion documents, spread across about 20 indexes.

The problem seems to be with the larger indexes. Small index (less than about 30GB in size) were reinitialized and brought back to a green state pretty quickly; larger indices (up to about 120GB) keep failing to reinitialize. The error I get (for one such index with 54GB of data and 140 million documents) is:

Caused by: RemoteTransportException[[muninn][][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]
; nested: RecoverFilesRecoveryException[Failed to transfer [205] files with total size of [54.4gb]]; nested: ReceiveTimeoutTransportException[[huginn][
0][internal:index/shard/recovery/prepare_translog] request_id [15086201] timed out after [900000ms]];
Caused by: [tomcat-2015.11][[tomcat-2015.11][0]] RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [205] files wit
h total size of [54.4gb]]; nested: ReceiveTimeoutTransportException[[huginn][][internal:index/shard/recovery/prepare_translog] request_id [15086201] t
imed out after [900000ms]];
at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(
at org.elasticsearch.indices.recovery.RecoverySource.recover(
at org.elasticsearch.indices.recovery.RecoverySource.access$200(
at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(
at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
Caused by: [tomcat-2015.11][[tomcat-2015.11][0]] RecoverFilesRecoveryException[Failed to transfer [205] files with total size of [54.4gb]]; nested: ReceiveTimeoutTranspor
tException[[huginn][][internal:index/shard/recovery/prepare_translog] request_id [15086201] timed out after [900000ms]];
at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(
at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(
... 9 more
Caused by: ReceiveTimeoutTransportException[[huginn][][internal:index/shard/recovery/prepare_translog] request_id [15086201] timed out after [900000ms
at org.elasticsearch.transport.TransportService$
... 3 more

It looks as though transfer of these larger indexes is failing after 900sec. Any idea how I can either increase this timeout, or increase the rate at which this recovery proceeds at?

John Ouellette

(system) #5