Data node lost, all shards go to RED - Data node returns but shards lost forever

This happened to me on a test bed running the 6.7 ELK stack on Kubernetes (3 ingest, 3 master, 3 32TB datanodes) with the configuration set to 5 Shards, 1 Replica.

** OBSERVATION
Kibana renders a json "message":"all shards failed: [search_phase_execution_exception] all shards failed"
** API CALLS AND RESULTS:
_cat/indices: indeed showed all indices were red.
_cluster/allocation/explain: listed one index shard as "unassigned" with reason "NODE_LEFT", "no_valid_shard_copy"
_cat/shards: showed every indices with at least one shard's primary and secondary as "UNASSIGNED"

** POD INVESTIGATION:
By the time I discovered the datanode went down it was already back up. So k8s showed the datanode back in the cluster and logs on that datanode showed no indication of anything bad. The other two datanodes had logged a 'cannot reach data node 2' exception at the same time. Terminaling into the datanode 2 container I perused the datanode data folder. /usr/share/elasticsearch/data/nodes/0/indices had a large number of indices on disk.

** THE "FIX"
I tried closing and reopening indices hoping that it would try to find the indices that appeared to be in the datanode2. This did nothing. Shards still missing, Index still red.
A lot of reading online didn't lead me to anything obvious that seemed like it would "rediscover" those lost shards on datanode2 so I turned my attention to just getting the cluster back to Green. After much reading I found the reroute api using "allocate_empty_primary" on each and every red index (which was nearly all)

** THE REMAINING QUESTIONS
So a few questions came out of this for me:

  1. Why, when I only lost one datanode, did my shards go to "red" and did I lose entire shard sets (primary/secondary)? I can understand losing either, but not both since ES never allocates both to the same node.
  2. Why, when the node came back online did ES not, eventually, re-discover the unassigned shards?
  3. Rerouting the shards got me back to green with data loss, but how could I have forced ES to search the rejoined datanode for missing shard primary or replica?

Any help/insight would be greatly appreciated. Thank you in advance!

Unfortunately, I think the answers to these questions were lost when you used the destructive allocate_empty_primary command. Elasticsearch would normally not report red health after losing a single node, assuming the cluster started out green and every index had at least one replica, and it would normally rediscover shard copies when a node rejoins the cluster.

Here's an article on diagnosing unassigned shards that would have helped to answer your questions:

1 Like

Hi David. Thank you for the link and immediate answer, but that is exactly the article that I was using for guidance. The real question that I couldn't get an answer for is: why didn't those shards come back online after the data node rejoined the cluster?
I know for sure it had come back up , the logs were showing communication to node 3 and the exceptions has ceased. Looking at 3's file system I could see all the folders and files still there. After waiting and waiting I saw no indication that those shards would be discovered and the unassigned moniker removed. How can I:
A) poll the system for current activity on bringing an old node back in?
B) force it to actively search and return an ack for found nodes /shards?

Ok, can you share the full outputs from the allocation explain API that you were receiving? I should also point out that the article directs you towards the allocate_empty_primary only in the event that...

all nodes holding copies of this particular shard are all permanently dead

... which wasn't the case here.

Many APIs, e.g. GET _cluster/health or GET _cat/nodes, will tell you whether the new node has rejoined the cluster.

This is automatic. There is no need to force this.

Given that this is a test-bed installation, can you reproduce the situation so we can investigate it further?

One other thing that sometimes happens in containerised environments is to accidentally configure Elasticsearch to write its data somewhere ephemeral inside the container, rather than to the mounted filesystem that persists across restarts. Can you confirm that this isn't the case here? Do you see any files in the persistent filesystem that have been recently modified, for instance? When the node starts up it writes a log line like this:

[2019-06-21T06:52:14,630][INFO ][o.e.n.Node               ] [node-0] node name [node-0], node ID [Tb0eYtmcS1mjwHAqhV_L-Q]

Does the node ID (the random string at the end) remain the same across restarts?

You mentioned that you have 32TB storage per node (which is quite a lot for a single Elasticsearch node) and a lot of indices. How many indices and shards do you have in the cluster? How much data do these hold? Also, what type of storage are you using?

If you have a very large number of indices and data, recovery and required cluster state updates can be slow. Are you using the default settings for recoveries?

Hi Christian, it's actually not much data. I have around 32gb and only around 300 indices. It's not much. These documents are actually pretty small, so one of my todos is to lower the default of 5 shards down to 1 or 2. I don't see any reason for 5, especially for such small documents (1k-2mb)
Storage is network attached scsi on a VM.
You're asking about "Recoveries" which doesn't sound too familiar. I was using the Unassigned count as an indicator of "recovery". Despite the datanode being back online for hours the unassigned count did not change so I had assumed that the shard + replica weren't being "recovered". Is there something else I should be looking at to monitor recovery?

Re Output From the Explain:
I love that Postman saves the sessions.
Yeah, here's the Explain but it's quite a bit...NOTE: what I seemed to notice that despite nearly all 300 or so indices being red, explain picks one and reports on it. If I 'clean' one up, then the next one is reported (clean up == allocate an empty primary to get it green)

{
    "index": "index1",
    "shard": 1,
    "primary": true,
    "current_state": "unassigned",
    "unassigned_info": {
        "reason": "NODE_LEFT",
        "at": "2019-06-19T05:47:51.571Z",
        "details": "node_left [QaYHdv4CTAWKrkOINIj8-A]",
        "last_allocation_status": "no_valid_shard_copy"
    },
    "can_allocate": "no_valid_shard_copy",
    "allocate_explanation": "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster",
    "node_allocation_decisions": [
        {
            "node_id": "OZv9BMGiQ3a-VDmyyTKPVA",
            "node_name": "datanode2",
            "transport_address": "10.244.64.6:9300",
            "node_attributes": {
                "ml.machine_memory": "63039193088",
                "ml.max_open_jobs": "20",
                "xpack.installed": "true",
                "ml.enabled": "true"
            },
            "node_decision": "no",
            "store": {
                "found": false
            }
        },
        {
            "node_id": "vrbCqsLBSB6JQonDeXuYvw",
            "node_name": "datanode1",
            "transport_address": "10.244.58.27:9300",
            "node_attributes": {
                "ml.machine_memory": "63039193088",
                "ml.max_open_jobs": "20",
                "xpack.installed": "true",
                "ml.enabled": "true"
            },
            "node_decision": "no",
            "store": {
                "found": false
            }
        },
        {
            "node_id": "x-ryrs9fQv2Xnnwgfjrj6A",
            "node_name": "datanode0",
            "transport_address": "10.244.43.6:9300",
            "node_attributes": {
                "ml.machine_memory": "63039184896",
                "ml.max_open_jobs": "20",
                "xpack.installed": "true",
                "ml.enabled": "true"
            },
            "node_decision": "no",
            "store": {
                "found": false
            }
        }
    ]
}

Re _cluster/health _cat/nodes
Yeah. neither were indicating shards were being brought in, though I DID see the third data node being listed. Unfortunately it doesn't look like postman saved the _cluster/health and _cat/nodes responses.

Re Node ID remain the Same across Restarts?
I hadn't tried - since this is a k8s cluster I must delete the pod which starts up a new one....
Node Name Before "Restart"

node name [elasticsearch-data-2], node ID [O_5rqmiQTTqtTbzUkr220w]

Node Name After "Restart"

node name [elasticsearch-data-2], node ID [vijOqGRVShycwYSn5jFdRg]

So the answer is no. The node id changes when a pod goes down and comes back up. Is this, maybe, the root of the problem? That these shards are expected to be on "O_5rqmiQTTqtTbzUkr220w" but are now "missing"?

NOTE: with the above action, my cluster health is now YELLOW and the unassigned shard count is 686 but is slowly dropping (as expected).

{
    "cluster_name": "elasticsearch",
    "status": "yellow",
    "timed_out": false,
    "number_of_nodes": 16,
    "number_of_data_nodes": 3,
    "active_primary_shards": 1076,
    "active_shards": 1462,
    "relocating_shards": 0,
    "initializing_shards": 4,
    "unassigned_shards": 686,
    "delayed_unassigned_shards": 0,
    "number_of_pending_tasks": 0,
    "number_of_in_flight_fetch": 0,
    "task_max_waiting_in_queue_millis": 0,
    "active_shards_percent_as_number": 67.93680297397769
}

Quick update - dropping the pod out and forcing k8s to build up a new one behaved exactly as I would expect:

  • ES Node falls out of cluster
  • Cluster health goes to Yellow, 686 unassigned shards, Data_Nodes: 2
  • K8S starts up new POD
  • Upon re-entry, the "new" ES Node is detected by the cluster
  • Cluster health still Yellow, 500 unassigned shards, Data_Nodes: 3
  • 7 mins later Cluster health Green, 0 unassigned shards, Data_Nodes: 3

Yes, that indicates that each time the pod starts up its data path is empty, and therefore that the data is not persisting across restarts. It's very important that both data and master-eligible nodes are using persistent storage.

The allocation explain output is consistent with this: it lists all three data nodes but indicates that none of them have a copy of the shard in question. Not even a stale or corrupt copy, they're just simply not there.

I suspect one of the nodes restarted while the cluster health was already yellow, indicating missing replicas, causing you to lose both copies of some of the shards.

1 Like

Thank you Christian and David, we finally got to the bottom of this. It was, in fact, that the data was not being written to persistent storage. For us the problem was the VolumeClaimTemplate was allocating the space, and it was getting mounted, but the configuration to the data folder was not being accepted and so ES was falling back to the default location of usr/share/elasticsearch/data.

Thank you for all of your help!

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.