Cannot recover index - store.found: false

hunsw · March 16, 2020, 9:59pm

After a full cluster restart, some of my indices remained red. Among them a few had replicas...

Cluster allocation explain API informs me that a pri 3 rep 1 index is missing its primary shard 0, on every node I get:

      "node_decision": "no",
      "store": {
        "found": false
      }

Well OK, I have a replica! But alas, it cannot be allocated:

          "decision": "NO",
          "explanation": "primary shard for this replica is not yet active"
        }

So... what is the proper way to handle this? I guess it has something to do with the cluster/reroute API calls?
I just want to promote the replica 0 shard and forgot about the loss of primary 0.

hunsw · March 16, 2020, 11:15pm

Ok, it seems like those indices with replicas recover after time, despite this initial error message.

However I still have indices with 0 replicas where one of the primary shards in the cluster allocation API is reported as:

      "node_decision": "no",
      "store": {
        "found": false
      }

It looks like it's missing shard data, but it was nothing more than a simple node restart... How can I restore this shard?

hunsw · March 17, 2020, 12:52pm

Also, is there a recommended way to restart nodes other than via systemctl restart?

Maybe stop, wait for java to disappear then start? Is systemctl restart ill-advised?

DavidTurner · March 17, 2020, 12:57pm

systemctl restart is reasonable, and shouldn't cause this situation. I think something else is wrong with your setup. Can you share the full output of this command?

GET _cluster/allocation/explain
{"index":"FILL_IN_INDEX_NAME_HERE","shard":0,"primary":true}

hunsw · March 17, 2020, 1:07pm

Here's the missing shard of 'myindex-2019-05-16'. (Tried to anonymize the data, but keep every relevant information there...)

  "index": "myindex-2019-05-16",
  "shard": 1,
  "primary": true,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "NODE_LEFT",
    "at": "2020-03-16T20:20:07.621Z",
    "details": "node_left [wvzK.....nodeid]",
    "last_allocation_status": "no_valid_shard_copy"
  },
  "can_allocate": "no_valid_shard_copy",
  "allocate_explanation": "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster",
  "node_allocation_decisions": [
    {
      "node_id": "1JRBD......",
      "node_name": "elasticsearch-siteone-07.mydomain.example.com",
      "transport_address": "172.16.141.56:9300",
      "node_attributes": {
        "xpack.installed": "true",
        "box_type": "hot"
      },
      "node_decision": "no",
      "store": {
        "found": false
      }
    },
    {
      "node_id": "4v4Z...",
      "node_name": "elasticsearch-sitetwo-08.mydomain.example.com",
      "transport_address": "172.16.161.57:9300",
      "node_attributes": {
        "xpack.installed": "true",
        "box_type": "hot"
      },
      "node_decision": "no",
      "store": {
        "found": false
      }
    },
#THE ABOVE REPEATS FOR ALL DATA NODES

DavidTurner · March 17, 2020, 1:10pm

Ok, if every data node reports store.found: false then this shard is gone. Are you sure you're using storage that persists across restarts?

hunsw · March 17, 2020, 1:15pm

Yes data is stored on raid 10/60 depending on how often it is accessed. I'm not aware of any disk failures. (And this is the second time a cluster restart resulted in a couple of red indices... )

Is it possible that the data files are on the node, but are not being read/found by Elasticsearch?

Also, this is version 6.8. And the cluster is overloaded, we are in the process of drastically reducing shard count and data stored in the cluster.

Hang on for a sec, I have an idea...

hunsw · March 17, 2020, 1:25pm

It seems that only those indices are really affected that were 0 or 1 days old. (The one in the example may be the result of an aborted shrink job(?)).

Also we have ZFS under the cluster, may be that has to do something with it... (Yea I was wrong about RAID10/60, that was the earlier cluster, this one has mirrors.)

hunsw · March 17, 2020, 1:36pm

So... if I have an index with a shard missing (store.found:false), but I do have the replica of the missing shard... How do I promote that replica to primary?

Cluster allocation API explains that:

          "decision": "NO",
          "explanation": "primary shard for this replica is not yet active"

DavidTurner · March 17, 2020, 2:02pm

store.found: false means there is no copy of the shard at all.

hunsw · March 17, 2020, 2:04pm

Ok, but what about indices with replica shards? It seems that the replica is available but won't be promoted to primary or assigned at all.

{
  "index": "xx-2020-03-15",
  "shard": 1,
  "primary": false,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "REPLICA_ADDED",
    "at": "2020-03-17T00:02:54.598Z",
    "last_allocation_status": "no_attempt"
  },
  "can_allocate": "no",
  "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions": [
    {
      "node_id": "1JR...",
      "node_name": "xxx",
      "transport_address": "ip:9300",
      "node_attributes": {
        "xpack.installed": "true",
        "box_type": "hot"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "replica_after_primary_active",
          "decision": "NO",
          "explanation": "primary shard for this replica is not yet active"
        },
        {
          "decider": "throttling",
          "decision": "NO",
          "explanation": "primary shard for this replica is not yet active"
        }
      ]
    },

DavidTurner · March 17, 2020, 2:08pm

No, that's not what this means. Until the shards are allocated there's not really any difference between primaries and replicas, and we don't even bother looking for possible replicas until we've assigned a primary. So store.found: false means that there's no copy of this shard, neither primary nor replica.

hunsw · March 17, 2020, 2:10pm

Ok, thank you for your help, I guess we need further investigation how this could happen.

hunsw · March 17, 2020, 2:33pm

Okay, this was very stupid on my part: one of the 20+ nodes was actually down. I trusted our automation system too much and didn't look close enough.

system · April 14, 2020, 2:33pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to get back my Primary shards? Elasticsearch	9	2368	July 5, 2017
UNASSIGNED replicas after reroute allocate_stale_primary Elasticsearch	1	449	March 18, 2022
Why shard unassigned after cluster restart completely? Elasticsearch	1	384	May 28, 2020
Primary shards not balanced after recovery Elasticsearch	1	451	July 6, 2017
Delete lost primary shard Elasticsearch	3	81	July 10, 2024

Cannot recover index - store.found: false

Related topics