Hi -- I have a small ES cluster with two data nodes. Each data node is running one ES instance with multiple path.data entries, one for each data disk on that node (six for each node). One disk on one node failed, and I am trying to figure out a procedure for replacing it.
What I did:
- Set cluster.routing.allocation.enable to none.
- Mark each index on that disk to have 'replicas: 0' (This might have been an extraneous step)
- Shut down ES on the node with the failed disk.
- Removed the disk, replaced it with a new disk, partitioned, remounted, etc, identically to the failed disk.
- Restarted ES on that node.
- Set replicas:1 for the affected indexes.
- Set cluster.routing.allocation.enable: all
What I expected to happen was that the replicas might get assigned to another disk and I expected data to go to the new disk, eventually. My cluster is still allocating shards (after two days), and I am not yet sure if data will ever go to the replaced disk.
What I am hoping to get help with is:
- is there a better way to replace a failed disk in ES than what I have described above?
- how can I speed up the shard allocation?
More details about my cluster:
- each data node has 96GB RAM, 24 cores, six data disks (600GB SAS, not SSD)
- run ES 2.0 using openJDK 126.96.36.199 on a CentOS 6.7 system.
- before you ask: RAID0 was not used because I didn't want to have to recover 3.6TB if a disk failed (maybe that was not a great decision, since it seems that I am doing that anyway), and I did not use any parity-based RAID or RAID10 due to space considerations, and the expectation that multiple path.data entries would give me (roughly) the equivalent of RAID0.
- I have multiple indexes, each with one shard and one replica. There are about 2bn documents indexed, comprising about 1TB of data.
Thanks for any insight.