We had a heap memory limit exceeded error today. I will detail in chronological order
Information:
Elastic version : 5.2
Number of node: 5
Total Shards : 37
Total Primary : 16
Total Replica : 21
No of index : 4
Heap memory exceed error.
We restarted only elastic service on particular node.
We saw 3 shards on that node is left in unassigned mode, after some time we restarted particular node.
Still all the 3 shards in that particular node are still in unassigned status.
Things tried:
PUT _cluster/settings
{
"transient": {
"cluster.routing.allocation.enable":"all"
}
}
I made this to "none", restarted elastic service and changed back to "all" again. Did not help
I did not find any log to this issue. Suggest me if there is another place to find it.
Please help as it is production environment.
If the output does not really point you in a good direction you can try to set the replicas for this index to 0 and then increase them to the desired amount afterwards. This will trigger the creation of new replicas and solve any issue that is present on the existing shards.
Note that this affect the whole index and not just the problematic shards.
Another thing that I just noticed is that your shards are fairly big. This is most likely not related to your issues, but we generally recommend shards between 10 and 80GB (quite a big range, I know - it heavily depends on what you are doing)
This reply helps.. Later in my evening (IST) I myself with my team members decided to set replicas to 1.
And when reallocation happened among nodes completed, i have set replicas to 2 for my major index.
Now the shards are in "INITIALIZING" status for last 1 hour. It is not moving from there. Can you suggest how much time will it take to recover completely.
One of the reasons we generally recommend a maximum shard size of around 50GB, is that recovery of very large shards, depending on cluster settings and network performance, can be quite slow. I suspect that is what you are seeing here.
Sorry for taking a while to get back to you - timezones are hard
Initializing is a good thing. At least if it's not stuck. For shards of your size this can take a while. For each shard it has to copy >200GB of data across the network and write it to the disk on another node. That will take a while to finish.
By now I would expect things to be done. If not, here are two more things you can do.
The plugin is what we really recommend but it would make things even worse in this case as you would have to restart things.
To get the same information you can use the cat recovery API https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-recovery.html
It will show you how far along Elasticsearch is with copying files
I have checked free space among nodes. each of them has 50 percent space free. Can you suggest some thing. I am on google now as well. Hope to get a master stroke from you folks pretty soon.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.