All Shards Unassigned due to Data Node Restarts

We are using ElasticSearch version 6.2.4 deployed on kubernetes, with 2 data shards, 3 masters, and 1 replica per shard.

All of our shards are currently unassigned and unable to be reassigned.

curl -X GET "http://$ES_HOST/_cluster/health?pretty"
{
  "cluster_name" : "es-cluster-dogfood",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 7,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 20,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 0.0
}

It seems that both of our data nodes restarted within a few minutes of each other.

Logs from the Master:

2019-04-29 11:13:35.474 PDT[2019-04-29T18:13:35,474][INFO ][o.e.c.r.a.AllocationService] [deployment-es-master-dogfood-f657f77d5-ldbbb] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{deployment-es-master-dogfood-f657f77d5-8msdl}{hmLZaF8BSl-wadr7XCGhyw}{uvBJAIuZT2-rlY4TxgvLHA}{10.20.153.8}{10.20.153.8:9300} transport disconnected, {statefulset-es-data-1}{-anHNjfhSHCn_m3mF731uQ}{Wo4RU_3vQyy0dVaN916M2Q}{10.20.153.9}{10.20.153.9:9300} transport disconnected]).

2019-04-29 11:16:08.067 PDT[2019-04-29T18:16:08,066][INFO ][o.e.c.r.a.AllocationService] [deployment-es-master-dogfood-f657f77d5-ldbbb] Cluster health status changed from [YELLOW] to [RED] (reason: [{statefulset-es-data-0}{VDpIAAVwQ-yvXjBw4Kvulg}{DbCjdJOvSqqmaWvYdAs3Kg}{10.20.164.3}{10.20.164.3:9300} left]).

This corresponds to logs from each data node, showing it re-initializing at these same times. Now all of our shards will not reassign to nodes. explain on one of our indices show this:

{
  "index" : "auditlog",
  "shard" : 0,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "NODE_LEFT",
    "at" : "2019-04-29T18:16:08.061Z",
    "details" : "node_left[VDpIAAVwQ-yvXjBw4Kvulg]",
    "last_allocation_status" : "no_valid_shard_copy"
  },
  "can_allocate" : "no_valid_shard_copy",
  "allocate_explanation" : "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster",
  "node_allocation_decisions" : [
    {
      "node_id" : "Vb4Su_pxTa2IaZ4F4FPDxw",
      "node_name" : "statefulset-es-data-1",
      "transport_address" : "10.20.153.9:9300",
      "node_decision" : "no",
      "store" : {
        "found" : false
      }
    },
    {
      "node_id" : "pIYGxPfuTOqHxMxPDyiccg",
      "node_name" : "statefulset-es-data-0",
      "transport_address" : "10.20.164.6:9300",
      "node_decision" : "no",
      "store" : {
        "found" : false
      }
    }
  ]
}

Even using allocate_stale_primary does not seem to work. This request:

curl -X POST "http://$ES_HOST/_cluster/reroute?pretty"  -H 'Content-Type: application/json' -d'
{
"commands": [{
        "allocate_stale_primary": {
            "index": "counter",
            "shard": 0,
            "node": "statefulset-es-data-0",
            "accept_data_loss": true
        }
    }]
}
'

Shows this failure when trying to reallocate

"unassigned_info" : {
             "reason" : "ALLOCATION_FAILED",
             "at" : "2019-04-30T01:05:32.469Z",
             "failed_attempts" : 3,
             "delayed" : false,
             "details" : "failed shard on node [pIYGxPfuTOqHxMxPDyiccg]: failed recovery, failure RecoveryFailedException[[counter][0]: Recovery failed on {statefulset-es-data-0}{pIYGxPfuTOqHxMxPDyiccg}{bTGu_2LQRZebahRunoedcw}{10.20.164.6}{10.20.164.6:9300}]; nested: IndexShardRecoveryException[failed to fetch index version after copying it over]; nested: IndexShardRecoveryException[shard allocated for local recovery (post api), should exist, but doesn't, current files: []]; nested: FileNotFoundException[no segments* file found in store(MMapDirectory@/data/data/nodes/0/indices/vNEkqOlKQPKFZ1kkIEAohw/0/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@530e9c6e): files: []]; ",
             "allocation_status" : "no_valid_shard_copy"
           }

It seems like we incurred some sort of data corruption by having both of our data nodes going down at the same time? Is there anything else I can do to debug this or try to get these indices back up? We are OK with some data loss, but would prefer not to lose our entire index. We are also quite concerned about the stability and resiliency of the system, if every time a shard has all its replicas go down concurrently, there is corruption like this.

Thanks

I suspect that you have configured your data nodes to only use ephemeral in-container storage, meaning that each time they restart their data directories are emptied. For instance, the ID of the node that last held a copy of the [auditlog][0] shard is different from the IDs of either node that is currently in your cluster:

    "details" : "node_left[VDpIAAVwQ-yvXjBw4Kvulg]",
vs
      "node_id" : "Vb4Su_pxTa2IaZ4F4FPDxw",
      "node_id" : "pIYGxPfuTOqHxMxPDyiccg",

The only time the node ID changes is if the data directory is emptied.

If you want your data to persist across restarts then you should store it outside of the containers. The master-eligible nodes also require persistent storage.

2 Likes

Hi David,

Thanks so much for the quick reply! I believe that has helped up identify the problem.

We are using ephemeral nodes for our dogfood instance, but we have persistent volumes attached

 volumeMounts:
        - mountPath: /usr/share/elasticsearch/data
          name: es-storage
...
...
volumeClaimTemplates:
  - metadata:
      creationTimestamp: null
      name: es-storage
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 256Gi
      storageClassName: elasticsearch-storageclass
    status:
      phase: Pending

As you can see, these persistent volumes are mounted at /usr/share/elasticsearch/data. I poked around on our prod instance where we have the same setup but are not using ephemeral nodes, and noticed that the data is actually being in stored /data instead of that persistent volume. So i think you are correct, the data is getting wiped every time the node restarts. We are going to change our mount location to /data and see if this fixes our problem.

I have one question, just to confirm my understanding is correct. We have had this setup for a while now without issue, and have the same setup in prod without issue either. Does this complete data lose only occur when both our nodes go down at the same time, because then both storages get wiped at the same time? If they were to go down at separate times, presumably the node that stayed up would get switched to primary and the replica node would get data copied over?

Thanks again for all the help, going to switch the configuration tomorrow and confirm that things are fixed. I can try manually taking down both of the es-data nodes at the same time to confirm there is no data loss.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.