Good morning everyone.
Last night one of the machines with one of the nodes of my cluster shutdown, so this morning I restarted it and waited for shard allocation. After waiting for a while, the restarted node only had about 10 shards allocated, and over 300 where unassigned.
This was not the first time it happened, so I used this command in DevTools
POST /_cluster/reroute?retry_failed=true
which usually fixed this problem in the past, in this case though, the response was the following:
{
"error": {
"root_cause": [
{
"type": "remote_transport_exception",
"reason": "[elk_node-2][10.100.100.148:9300][cluster:admin/reroute]"
}
],
"type": "null_pointer_exception",
"reason": null
},
"status": 500
}
I started looking through the logs of the nodes and I found this in node 10.100.100.150
org.elasticsearch.transport.RemoteTransportException: [ta_elk_node-2][10.195.194.148:9300][cluster:admin/reroute]
Caused by: java.lang.NullPointerException
at org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDecider.getExpectedShardSize(DiskThresholdDecider.java:421) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator$Balancer.allocateUnassigned(BalancedShardsAllocator.java:847) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator$Balancer.access$000(BalancedShardsAllocator.java:232) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator.allocate(BalancedShardsAllocator.java:123) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:413) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:350) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.action.admin.cluster.reroute.TransportClusterRerouteAction$ClusterRerouteResponseAckedClusterStateUpdateTask.execute(TransportClusterRerouteAction.java:124) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:643) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:272) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:202) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:137) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) ~[elasticsearch-6.6.1.jar:6.6.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_171]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_171]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
I assumed it was a problem of disk space, since we are currently a bit tight on space, so I deleted some old indexes. This allowed the node that died overnight to assign another 5 shards, bringing the total to 15, while the other nodes both have over 700, and about 300 still need to be assigned.
Just to be clear, the node that died overnight and the node where I find the nullpointerexception are two different nodes.
I am at a loss about what to do, there seems to be no information online about this type of problem, or at least I was unable to find them