Elasticsearch Cluster Status is RED

Hi guys:
I need advise to fix the following error after issuing _cluster/allocation/explain. I did try to fix it by running this _cluster/reroute?retry_failed=true but no luck. Appreciate your advise, thanks so much.

{
  "index : "shrink-customer-logs",
  "shard" : 0,
  "primary" :  false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "CLUSTER_RECOVERED",
    "at" : "2021-03-24T04:20:42.665Z",
    "last_allocation_status" : "no_attempt"
},
  "can_allocate" : "no",
  "allocate_explaination" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
  {
   ....
   }, 
     "node_decision" : "no",
     "deciders" : [
     {
       "decider" : "replica_after_primary_active",
       "decision" : "NO",
       "explaination" : "primary shard fro this replica is not yet active"
     },
    {
       "decider" : "throttling",
       "decision" : "NO",
       "explaination" : "primary shard for this replica is not yet active"
    }
    ]
  },
}

What is the full output of the cluster stats API? Have you by any chance reached the shard limit per node ? Are you perhaps running out of disk space?

1 Like

Thanks Christian for your replied and I will send the output over soon. In the meanwhile, how would I check if the ES cluster has reached the shard limit per node? Initially we were running out disk space but no longer now.

Strangely, I can't see the respective index in our cluster.

Please advise, thanks.

The default limit is 1000 shards per data node, but you should ideally aim to be far below this.

1 Like

Hi Christian, here you are the cluster stats API

{
  "_nodes" : {
    "total" : 6,
    "successful" : 6,
    "failed" : 0
  },
  "cluster_name" : "cluster_name",
  "cluster_uuid" : "cluster_uid",
  "timestamp" : 1622094324687,
  "status" : "red",
  "indices" : {
    "count" : 61,
    "shards" : {
      "total" : 382,
      "primaries" : 191,
      "replication" : 1.0,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 12,
          "avg" : 6.262295081967213
        },
        "primaries" : {
          "min" : 1,
          "max" : 6,
          "avg" : 3.1311475409836067
        },
        "replication" : {
          "min" : 1.0,
          "max" : 1.0,
          "avg" : 1.0
        }
      }
    },
    "docs" : {
      "count" : xxx,
      "deleted" : xxx
    },
    "store" : {
      "size_in_bytes" : xx
    },
    "fielddata" : {
      "memory_size_in_bytes" : 76432,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 170881133,
      "total_count" : 21561575,
      "hit_count" : 1571964,
      "miss_count" : 19989611,
      "cache_size" : 27519,
      "cache_count" : 177753,
      "evictions" : 150234
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 6626,
      "memory_in_bytes" : xx,
      "terms_memory_in_bytes" : 2445657104,
      "stored_fields_memory_in_bytes" : 752527712,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 7775296,
      "points_memory_in_bytes" : 61055859,
      "doc_values_memory_in_bytes" : 12895000,
      "index_writer_memory_in_bytes" : 1158230110,
      "version_map_memory_in_bytes" : 95079249,
      "fixed_bit_set_memory_in_bytes" : 1393952,
      "max_unsafe_auto_id_timestamp" : 1622073607345,
      "file_sizes" : { }
    }
  },
  "nodes" : {
    "count" : {
      "total" : 6,
      "data" : 6,
      "coordinating_only" : 0,
      "master" : 6,
      "ingest" : 6
    },
    "versions" : [
      "7.2.0"
    ],
    "os" : {
      "available_processors" : xx,
      "allocated_processors" : xx,
      "names" : [
        {
          "name" : "Linux",
          "count" : 6
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "xx",
          "count" : 6
        }
      ],
      "mem" : {
        "total_in_bytes" : 201401745408,
        "free_in_bytes" : 6521278464,
        "used_in_bytes" : 194880466944,
        "free_percent" : 3,
        "used_percent" : 97
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 7
      },
      "open_file_descriptors" : {
        "min" : 1624,
        "max" : 1710,
        "avg" : 1646
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 1905769850,
      "versions" : [
        {
          "version" : "12.0.1",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "12.0.1+12",
          "vm_vendor" : "Oracle Corporation",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 6
        }
      ],
      "mem" : {
        "heap_used_in_bytes" : 51646454896,
        "heap_max_in_bytes" : 100491853824
      },
      "threads" : 780
    },
    "fs" : {
      "total_in_bytes" : 3219627466752,
      "free_in_bytes" : 2102545743872,
      "available_in_bytes" : 2102545743872
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "netty4" : 1
      },
      "http_types" : {
        "netty4" : 1
      }
    },
    "discovery_types" : {
      "zen" : 6
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "tar",
        "count" : 6
      }
    ]
  }
}

That looks fine. Are there any errors in the Elasticsearch logs?

Sorry for my late replied, Christian. They are plenty of errors, are the any specific errors that you are after?

If the log does not fit here, please store it as a gist or using pastebin. I am not sure what type of error am expecting which is why I asked for the logs.

Sigh, I have second thought of sending you the logs coz due to security reason I won't able to share the logs :frowning:

If you can not share the full logs then please share some of the errors. As the stats look fine it is hard to guess what could be wrong.

Ok Christian. I’m off work now and will send through tomorrow.

Thank you so much for following through my issues, really appreciate it.

Hello Christian, I can see many of the following errors but I don't think it relates the issue but it I did _cat/shards/shrink*?v, see output below

"current step ([{ "phase","cold","action":"complete","name":"complete for index-xxx with policy xxx is not regconised"

"failed to execute [search] input for watch xxx, reasons [all shards failed]"

_cat/shards/shrink*?v

index        shard prirep          state
shrink-1     0         p           started
shrink-1     0         r           started
shrink-2     0         r           unassigned
shrink-2     0         r           unassigned

I think the "unassigned" state that caused the errors "primary shard from this replica is not yet active" when I used the /allocation/explain.

Please advise, Christian. Thanks again

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.