Node is in cluster but shards are unassigned

We had a heap memory limit exceeded error today. I will detail in chronological order

Information:
Elastic version : 5.2
Number of node: 5
Total Shards : 37
Total Primary : 16
Total Replica : 21
No of index : 4

  1. Heap memory exceed error.
  2. We restarted only elastic service on particular node.
  3. We saw 3 shards on that node is left in unassigned mode, after some time we restarted particular node.
  4. Still all the 3 shards in that particular node are still in unassigned status.

Things tried:

  1. PUT _cluster/settings
    {
    "transient": {
    "cluster.routing.allocation.enable":"all"
    }
    }
    I made this to "none", restarted elastic service and changed back to "all" again. Did not help

I did not find any log to this issue. Suggest me if there is another place to find it.
Please help as it is production environment.

Hi Kumar,
I would suggest to make use of the new allocation explain API https://www.elastic.co/guide/en/elasticsearch/reference/5.2/cluster-allocation-explain.html
It will give you detailed information about why your shards are unassigned.

If the output does not really point you in a good direction you can try to set the replicas for this index to 0 and then increase them to the desired amount afterwards. This will trigger the creation of new replicas and solve any issue that is present on the existing shards.

Note that this affect the whole index and not just the problematic shards.

Another thing that I just noticed is that your shards are fairly big. This is most likely not related to your issues, but we generally recommend shards between 10 and 80GB (quite a big range, I know - it heavily depends on what you are doing)

Let me know if this works for you,
Luca

Thanks for the prompt reply..

This reply helps.. Later in my evening (IST) I myself with my team members decided to set replicas to 1.
And when reallocation happened among nodes completed, i have set replicas to 2 for my major index.

Now the shards are in "INITIALIZING" status for last 1 hour. It is not moving from there. Can you suggest how much time will it take to recover completely.

Any other thing can i try?

One of the reasons we generally recommend a maximum shard size of around 50GB, is that recovery of very large shards, depending on cluster settings and network performance, can be quite slow. I suspect that is what you are seeing here.

1 Like

Sorry for taking a while to get back to you - timezones are hard :slight_smile:

Initializing is a good thing. At least if it's not stuck. For shards of your size this can take a while. For each shard it has to copy >200GB of data across the network and write it to the disk on another node. That will take a while to finish.

By now I would expect things to be done. If not, here are two more things you can do.

I can highly recommend using x-pack monitoring to see what is happening in your cluster. It is part of the basic license and therefore free to use. https://www.elastic.co/guide/en/x-pack/current/xpack-monitoring.html
Here are a few screenshots https://www.elastic.co/guide/en/x-pack/current/monitoring-details.html
Once installed into both Elasticsearch and Kibana you can view shard activity in the overview tab.
Unfortunately this requires you to install the plugin and restart Elasticsearch.

The plugin is what we really recommend but it would make things even worse in this case as you would have to restart things.
To get the same information you can use the cat recovery API https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-recovery.html
It will show you how far along Elasticsearch is with copying files

Yes the initialising things are over but a weird situation has arrived which has confused for last 36 hours.

The shards have kept on reallocating between two nodes (Node 3, Node 4).

This above screenshot has been taken at 1.30 PM (IST) . Here the screenshot shows shards are reallocating from node 4 to node 3

and the below screenshot has been taken at 11.30 PM (IST). Here the screenshot shows shards are reallocating from node 3 to node 4.

.

I have checked free space among nodes. each of them has 50 percent space free. Can you suggest some thing. I am on google now as well. Hope to get a master stroke from you folks pretty soon.

that is a bit odd. I would expect things to get back to normal on its own.

While I don't usually recommend customers to do this, you can take a look at the following settings:
https://www.elastic.co/guide/en/elasticsearch/reference/current/shards-allocation.html#_shard_balancing_heuristics

These influence how aggressive Elasticsearch is in rebalancing things. Especially if you have these large shards these settings can help.

You have been really helpful to me. That link did help me. I changed the cluster settings in below fashion

PUT /_cluster/settings
{
"transient": {
"cluster.routing.allocation.enable": "none"
}
}

So once allocation is done, the reallocation was not started.

After this i forced reroute of a particular shard from node 4 to node 3.

POST /_cluster/reroute
{
"commands" : [
{
"move" : {
"index" : "matpa", "shard" : 2,
"from_node" : "node-004", "to_node" : "node-003"
}
}
]
}

Once this process was over, I moved allocation to all

PUT /_cluster/settings
{
"transient": {
"cluster.routing.allocation.enable": "all"
}
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.