Hello.
I manage a cluster of 3 data nodes and 1 master.
Today the master have crashed with several exceptions:
- a network issue ("fatal error on the network layer").
- a StackOverflow: null
I was bulk inserting a few hundred documents max.
Upon restarting the master, I have seen that all the shards of the cluster have been unassigned!
There are 24k shards in total.
The unassigned.reason provided "CLUSTER_RECOVERED".
I don't see why this would happen.
After the restart there is a ClusterBlockException: "blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized]".
And then an ElasticsearchTimeoutException "Timeout waiting for task".
So I leave the cluster to reassign the shards, but the operation is reaaally slow.
And it seem to get slower and slower with time.
The log shows a JvmGcMonitorService message about every minute that says ~300ms of collecting over the last second.
I tried to restart the master, hoping that whatever memory hog will go away.
But then the number of unassigned shards got back to 24k!
Is this normal? I would have thought that whatever assignation would have persisted.
Right now there is still 1k shards to assign, but I have no idea when it will finish.
There is less than one shard allocation per second.
And I am worried that on the next restart the issue will arise again.
I am running ES 5.6.3 on an Oracle JDK 8.
There is 31 GB of XMX (more than enough I think).
The CPU/memory of the machine is enough too.
Anyone has a clue on what is happening?