A little about the setup -
I have a three node (m4.large) cluster, each node with an attached EBS volume (latency is not a big problem at present). The cluster seems to be pretty stable. A couple of months ago, I needed additional space in the nodes, so I upgraded the fleet with more EBS space, doing a rolling restart. Everything worked out perfectly.
Now, I am trying to benchmark performance between local storage vs EBS, among other things. For this I created a replica of the production cluster, using AMIs of individual nodes (I wanted to benchmark against the current volume of data in the production cluster), moved them to a new security group and updated the config file appropriately (the cluster name, etc.). This new cluster comes up easily, and works as expected.
Then, I provisioned m3.large instances with enough SSD and added one new node to the cluster. Here arises the problem - a few shards initiate the relocating process but they never seem to complete the relocation on the new node. Although the new node joins the cluster seamlessly, the new node just doesn't get any data.
Things I have verified -
- Identical config files on all the nodes.
- New node has permissions to write in the path.data location.
- All nodes are on the same ES version - 2.0.0.
- Master throws no exception when the new node joins the cluster, neither does the new node.
- I was able to telnet/nmap from one node to another (from the origin-relocation node to the destination-relocation node, and vice-versa), on 9200/9300.
Random (some seemingly ridiculous) things I have tried, but without success -
- Provisioning new m4.large instance instead of m3.large (trouble with instance types).
- Manually copying the shards trying to relocate into the new node (problem with network transfer, maybe it was taking too long).
- Deleting the index in question (maybe it was corrupted?), so that the relocation process jumps to some other random index.
- Restarting the new node (duh!).
- Removing the data and logs directories (could be a problem with permissions).
- Bringing down one of the old nodes (maybe it was limiting to three nodes only).
- Allowed complete access in that security group.
Maybe this could help with debugging -
- I could see '/nodes/0/node.lock' getting created in the path.data location.
- Log -
[2016-01-06 05:40:21,843][DEBUG][action.admin.indices.recovery] [GBShaw] [indices:monitor/recovery] failed to execute operation for shard [[XXX], node[_rZZiw4JRYefK8ID2JJrZA], relocating [4eCSmBjfSk2WMzUo2yuZhg], [P], v, s[INITIALIZING], a[id=rKyG1GqHRMmgkEpcKIN1TQ, rId=V81PGknmS7iFQGkdADW2nQ], expected_shard_size]
[XXX][[XXX]] BroadcastShardOperationFailedException[operation indices:monitor/recovery failed]; nested: IndexNotFoundException[no such index];
Caused by: [XXX] IndexNotFoundException[no such index]
... 7 more
- It has been in this state for more than 12 hours!
I have been stuck on this for a day now, and can't seem to work it out. I would appreciate any pointers to help me understand what could be the issue.
Please let me know if I can provide you with any more information. Again, any help will be appreciated!