Very slow cluster, cannot find "_cluster" index?

Hi, we have a cluster of 34 nodes running Elastic version 2.2 . The cluster started to behave very weirdly now the shard allocation is very slow and there are "URGENT" pending tasks with over 8 hours in the queue. GET _cluster/allocation/explain returns that the root cause is that "_cluster" index is not found.
{
"error": {
"root_cause": [
{
"type": "index_not_found_exception",
"reason": "no such index",
"resource.type": "index_expression",
"resource.id": "_cluster",
"index": "_cluster"
}
],
"type": "index_not_found_exception",
"reason": "no such index",
"resource.type": "index_expression",
"resource.id": "_cluster",
"index": "_cluster"
},
"status": 404
}
Could you guys maybe give some pointers as to how to interpret the fact that "_cluster" index is not found? The logs from master node state that processing of some events time out from time to time or that shards cannot be reallocated to some nodes due to disk threshold. Would prefer to avoid having to restart the whole cluster, if possible.

The Cluster Allocation Explain API was released with Elasticsearch 5.0 (see this link) so the command you tried running is not working in your 2.2 cluster.

Sorry, 0 experience with this and stack overflow said that the version.number is version number. Here's more info.
"version": {
"number": "2.2.2",
"build_hash": "fcc01dd81f4de6b2852888450ce5a56436fd5852",
"build_timestamp": "2016-03-29T08:49:35Z",
"build_snapshot": false,
"lucene_version": "5.4.1"
}

Strange, then you should be able to run the command.

Testing the same call in my 5.6.4 cluster I get an exception too:

user@node-04:~$ curl -X GET "localhost:9200/_cluster/allocation/explain?pretty"
{
  "error" : {
    "root_cause" : [
      {
        "type" : "remote_transport_exception",
        "reason" : "[master2][10.xx.xx.xx:9xxx][cluster:monitor/allocation/explain]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]"
  },
  "status" : 400
}

This is because there are no unassigned shards in my cluster, but in your situation it should produce a sensible result, and I don't understand why it tries to interpret "_cluster" as an index name. That is very strange.

Wait a minute. It's Lucene version 5.4.1. So you are running Elasticsearch version 2.2.2.

Then my initial reply still stands, you can't use the Cluster Allocation Explain API prior to Elasticsearch version 5.0.

1 Like

Yup, posted the answer, read documentation some more (the very basics, honestly). You are right. Thank you. Back to reading and looking through various logs for us then.

How many indices and shards do you have in the cluster? How much data?

_cluster/stats says 5776 indices and 42616 shards, store.size_in_bytes 21998772983623 (almost 22tb).

That is a lot of shards for that data volume. The average shard size is just around 500MB or so. having a lot of small shards can be very inefficient, so I would recommend reading this blog post about shards and sharding practices and then try to reduce this.

Will do. Thank you!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.