I have a cluster with 1 dedicated master node, 1 client node, and 5 data nodes..
While running a particularly heavy aggregation, the cluster turned yellow and a few replica shards got stuck in
Initializing state. After the aggregation was ran again, a lot of replica shards (~1/6th ) became unassigned.
Now I see that one of the data nodes has been assigned a bulk of the primary shards and it is running out of disk space fast.
All data nodes have 500gb disks, and the other 4 have between 100 and 150 GB of disk space free. The 5th one has less than 20.
How can I remedy this? Will moving shards away from this node help?
EDIT: I'm using Elasticsearch version 1.7.1
I would suggest you to first check your _routing field
There is a chance that your ids (by default _routing field uses ids) are not well distributed, causing the routing to crowd certain nodes.
Elasticsearch auto generates ids, so I doubt thats the case
Do you have parent-child relationship in your index?
No. Its a standard logstash populated index
If so, probably you hit a bug (or hit by one). See if this helps https://github.com/elastic/elasticsearch/pull/14494
Thanks @Josh_J_Luo, I'll take a look.
The cluster recovered on its own though. The problematic node went from almost running out of disk space (<1gb) to having over a 100gb free.