Index with multiple replicas turned red when node with primary went down

Hi all,

I saw an unusual issue in our cluster where one of the indices configured with 1p:2r turned red when the node with primary shard went down. By the time I was checking the node was already back in cluster and the state went back to green. The cluster remained red state for around 6-8 minutes at least and only turned green since the down node was back as far as I understand.

I have added the logs related to the index below (abc__events-2023.10.18). The problem looks to be starting around "2023-10-18T08:41:30,766".

Can anyone suggest any ideas or thoughts on why it went red while it should only be yellow with multiple replicas available on other nodes?

[2023-10-18T00:00:32,140][INFO ][o.e.c.m.MetadataCreateIndexService] [elastic-node-eastus2-3-vm-0] [abc__events-2023.10.18] creating index, cause [auto(bulk api)], templates [events-template], shards [1]/[2]
[2023-10-18T00:00:33,364][INFO ][o.e.c.r.a.AllocationService] [elastic-node-eastus2-3-vm-0] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[abc__events-2023.10.18][0]]]).
[2023-10-18T08:41:30,766][WARN ][o.e.c.r.a.AllocationService] [elastic-node-eastus2-3-vm-0] failing shard [failed shard, shard [abc__events-2023.10.18][0], node[Jk9Y9EtDSmS_XXXXXX], [R], s[STARTED], a[id=gfggfYtfR7ajKXd4kj73Gg], message [failed to perform indices:data/write/bulk[s] on replica [abc__events-2023.10.18][0], node[Jk9Y9EtDSmS_XXXXXX], [R], s[STARTED], a[id=gfggfYtfR7ajKXd4kj73Gg]], failure [IndexShardClosedException[CurrentState[CLOSED] Primary closed.]], markAsStale [true]]
[2023-10-18T08:41:31,023][WARN ][o.e.c.r.a.AllocationService] [elastic-node-eastus2-3-vm-0] failing shard [failed shard, shard [abc__events-2023.10.18][0], node[pA_ggBT3Q9mmUdXXXX], [R], s[STARTED], a[id=tH7xi2bAQAi1ce2PwJ5ylQ], message [failed to perform indices:data/write/bulk[s] on replica [abc__events-2023.10.18][0], node[pA_ggBT3Q9mmUdXXXX], [R], s[STARTED], a[id=tH7xi2bAQAi1ce2PwJ5ylQ]], failure [IndexShardClosedException[CurrentState[CLOSED] Primary closed.]], markAsStale [true]]
[2023-10-18T08:41:31,024][WARN ][o.e.c.r.a.AllocationService] [elastic-node-eastus2-3-vm-0] failing shard [failed shard, shard [abc__events-2023.10.18][0], node[TN3qE5YWRMaji3pWXXXX], [P], s[STARTED], a[id=lCe4tac4Q-q8pl0gRMTETw], message [shard failure, reason [already closed by tragic event on the translog]], failure [IOException[Read-only file system]], markAsStale [true]]
[2023-10-18T08:41:31,032][INFO ][o.e.c.r.a.AllocationService] [elastic-node-eastus2-3-vm-0] Cluster health status changed from [YELLOW] to [RED] (reason: [shards failed [[abc__events-2023.10.18][0], [abc__events-2023.10.18][0]]]).
[2023-10-18T08:42:00,633][WARN ][r.suppressed             ] [elastic-node-eastus2-3-vm-0] path: /abc__events-2023.10.18/_search, params: {typed_keys=true, max_concurrent_shard_requests=5, ignore_unavailable=true, expand_wildcards=open,closed, allow_no_indices=true, index=abc__events-2023.10.18, search_type=query_then_fetch, batched_reduce_size=512}
[2023-10-18T08:43:01,869][WARN ][r.suppressed             ] [elastic-node-eastus2-3-vm-0] path: /abc__events-2023.10.18/_search, params: {typed_keys=true, max_concurrent_shard_requests=5, ignore_unavailable=true, expand_wildcards=open,closed, allow_no_indices=true, index=abc__events-2023.10.18, search_type=query_then_fetch, batched_reduce_size=512}
[2023-10-18T08:43:41,014][WARN ][r.suppressed             ] [elastic-node-eastus2-3-vm-0] path: /abc__events-2023.10.18/_search, params: {typed_keys=true, max_concurrent_shard_requests=5, ignore_unavailable=true, expand_wildcards=open,closed, allow_no_indices=true, index=abc__events-2023.10.18, search_type=query_then_fetch, batched_reduce_size=512}
[2023-10-18T08:47:01,243][WARN ][r.suppressed             ] [elastic-node-eastus2-3-vm-0] path: /abc__events-2023.10.18/_search, params: {typed_keys=true, max_concurrent_shard_requests=5, ignore_unavailable=true, expand_wildcards=open,closed, allow_no_indices=true, index=abc__events-2023.10.18, search_type=query_then_fetch, batched_reduce_size=512}
[2023-10-18T08:47:14,527][WARN ][r.suppressed             ] [elastic-node-eastus2-3-vm-0] path: /abc__events-2023.10.18/_search, params: {typed_keys=true, max_concurrent_shard_requests=5, ignore_unavailable=true, expand_wildcards=open,closed, allow_no_indices=true, index=abc__events-2023.10.18, search_type=query_then_fetch, batched_reduce_size=512}
[2023-10-18T08:48:01,088][WARN ][r.suppressed             ] [elastic-node-eastus2-3-vm-0] path: /abc__events-2023.10.18/_search, params: {typed_keys=true, max_concurrent_shard_requests=5, ignore_unavailable=true, expand_wildcards=open,closed, allow_no_indices=true, index=abc__events-2023.10.18, search_type=query_then_fetch, batched_reduce_size=512}
[2023-10-18T08:48:27,848][INFO ][o.e.c.r.a.AllocationService] [elastic-node-eastus2-3-vm-0] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[abc__events-2023.10.18][0]]]).

#Shards for index while checking later

index                           shard prirep state      docs   store ip           node
abc__events-2023.10.18 0     r      STARTED 2236914   694mb 192.168.XX.XX elastic-node-eastus2-3-vm-0
abc__events-2023.10.18 0     p      STARTED 2236914 691.6mb 192.168.XX.XX elastic-node-central-1-vm-0
abc__events-2023.10.18 0     r      STARTED 2236914 691.6mb 192.168.XX.XX elastic-node-central-3-vm-0

Which version of Elasticsearch are you using?

7.17.8

What is the full output of the cluster stats API?

{
  "_nodes" : {
    "total" : 8,
    "successful" : 8,
    "failed" : 0
  },
  "cluster_name" : "env-elastic",
  "cluster_uuid" : "VODX2vTFRQKKqgXXXXX",
  "timestamp" : 1697800268559,
  "status" : "green",
  "indices" : {
    "count" : 251,
    "shards" : {
      "total" : 592,
      "primaries" : 251,
      "replication" : 1.3585657370517927,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 3,
          "avg" : 2.358565737051793
        },
        "primaries" : {
          "min" : 1,
          "max" : 1,
          "avg" : 1.0
        },
        "replication" : {
          "min" : 1.0,
          "max" : 2.0,
          "avg" : 1.3585657370517927
        }
      }
    },
    "docs" : {
      "count" : 229304320,
      "deleted" : 162439
    },
    "store" : {
      "size_in_bytes" : 228179253656,
      "total_data_set_size_in_bytes" : 228179253656,
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size_in_bytes" : 234912,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 540185560,
      "total_count" : 245686371,
      "hit_count" : 40125053,
      "miss_count" : 205561318,
      "cache_size" : 463451,
      "cache_count" : 1215293,
      "evictions" : 751842
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 6805,
      "memory_in_bytes" : 21493414,
      "terms_memory_in_bytes" : 10709648,
      "stored_fields_memory_in_bytes" : 4493400,
      "term_vectors_memory_in_bytes" : 141152,
      "norms_memory_in_bytes" : 510144,
      "points_memory_in_bytes" : 0,
      "doc_values_memory_in_bytes" : 5639070,
      "index_writer_memory_in_bytes" : 852264708,
      "version_map_memory_in_bytes" : 85683332,
      "fixed_bit_set_memory_in_bytes" : 0,
      "max_unsafe_auto_id_timestamp" : 1697618907933,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "binary",
          "count" : 30,
          "index_count" : 30,
          "script_count" : 0
        },
        {
          "name" : "boolean",
          "count" : 128,
          "index_count" : 80,
          "script_count" : 0
        },
        {
          "name" : "constant_keyword",
          "count" : 6,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "date",
          "count" : 236,
          "index_count" : 176,
          "script_count" : 0
        },
        {
          "name" : "integer",
          "count" : 34,
          "index_count" : 34,
          "script_count" : 0
        },
        {
          "name" : "ip",
          "count" : 14,
          "index_count" : 14,
          "script_count" : 0
        },
        {
          "name" : "keyword",
          "count" : 1350,
          "index_count" : 249,
          "script_count" : 0
        },
        {
          "name" : "long",
          "count" : 212,
          "index_count" : 107,
          "script_count" : 0
        },
        {
          "name" : "object",
          "count" : 71,
          "index_count" : 51,
          "script_count" : 0
        },
        {
          "name" : "text",
          "count" : 943,
          "index_count" : 138,
          "script_count" : 0
        }
      ],
      "runtime_field_types" : [
        {
          "name" : "date",
          "count" : 94,
          "index_count" : 10,
          "scriptless_count" : 94,
          "shadowed_count" : 0,
          "lang" : [ ],
          "lines_max" : 0,
          "lines_total" : 0,
          "chars_max" : 0,
          "chars_total" : 0,
          "source_max" : 0,
          "source_total" : 0,
          "doc_max" : 0,
          "doc_total" : 0
        },
        {
          "name" : "keyword",
          "count" : 20701,
          "index_count" : 12,
          "scriptless_count" : 20701,
          "shadowed_count" : 0,
          "lang" : [ ],
          "lines_max" : 0,
          "lines_total" : 0,
          "chars_max" : 0,
          "chars_total" : 0,
          "source_max" : 0,
          "source_total" : 0,
          "doc_max" : 0,
          "doc_total" : 0
        }
      ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [
        {
          "name" : "edgeNGram",
          "count" : 59,
          "index_count" : 59
        },
        {
          "name" : "edge_ngram",
          "count" : 18,
          "index_count" : 18
        },
        {
          "name" : "nGram",
          "count" : 11,
          "index_count" : 11
        },
        {
          "name" : "ngram",
          "count" : 12,
          "index_count" : 12
        },
        {
          "name" : "pattern_capture",
          "count" : 25,
          "index_count" : 25
        },
        {
          "name" : "pattern_replace",
          "count" : 33,
          "index_count" : 25
        },
        {
          "name" : "stemmer",
          "count" : 44,
          "index_count" : 44
        },
        {
          "name" : "stop",
          "count" : 224,
          "index_count" : 127
        },
        {
          "name" : "word_delimiter",
          "count" : 53,
          "index_count" : 53
        },
        {
          "name" : "word_delimiter_graph",
          "count" : 21,
          "index_count" : 21
        }
      ],
      "analyzer_types" : [
        {
          "name" : "custom",
          "count" : 393,
          "index_count" : 127
        },
        {
          "name" : "pattern",
          "count" : 24,
          "index_count" : 24
        }
      ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [
        {
          "name" : "classic",
          "count" : 12,
          "index_count" : 12
        },
        {
          "name" : "standard",
          "count" : 151,
          "index_count" : 79
        },
        {
          "name" : "whitespace",
          "count" : 230,
          "index_count" : 75
        }
      ],
      "built_in_filters" : [
        {
          "name" : "cjk_bigram",
          "count" : 8,
          "index_count" : 8
        },
        {
          "name" : "cjk_width",
          "count" : 8,
          "index_count" : 8
        },
        {
          "name" : "lowercase",
          "count" : 393,
          "index_count" : 127
        }
      ],
      "built_in_analyzers" : [ ]
    },
    "versions" : [
      {
        "version" : "7.5.2",
        "index_count" : 97,
        "primary_shard_count" : 97,
        "total_primary_bytes" : 3133320734
      },
      {
        "version" : "7.17.0",
        "index_count" : 154,
        "primary_shard_count" : 154,
        "total_primary_bytes" : 79815253712
      }
    ]
  },
  "nodes" : {
    "count" : {
      "total" : 8,
      "coordinating_only" : 1,
      "data" : 6,
      "data_cold" : 6,
      "data_content" : 6,
      "data_frozen" : 6,
      "data_hot" : 6,
      "data_warm" : 6,
      "ingest" : 6,
      "master" : 7,
      "ml" : 6,
      "remote_cluster_client" : 6,
      "transform" : 6,
      "voting_only" : 1
    },
    "versions" : [
      "7.17.0"
    ],
    "os" : {
      "available_processors" : 52,
      "allocated_processors" : 52,
      "names" : [
        {
          "name" : "Linux",
          "count" : 8
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "Ubuntu 20.04.5 LTS",
          "count" : 8
        }
      ],
      "architectures" : [
        {
          "arch" : "amd64",
          "count" : 8
        }
      ],
      "mem" : {
        "total_in_bytes" : 218623082496,
        "free_in_bytes" : 4690100224,
        "used_in_bytes" : 213932982272,
        "free_percent" : 2,
        "used_percent" : 98
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 2
      },
      "open_file_descriptors" : {
        "min" : 504,
        "max" : 1543,
        "avg" : 1184
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 6310232868,
      "versions" : [
        {
          "version" : "17.0.1",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "17.0.1+12",
          "vm_vendor" : "Eclipse Adoptium",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 8
        }
      ],
      "mem" : {
        "heap_used_in_bytes" : 46620832296,
        "heap_max_in_bytes" : 111669149696
      },
      "threads" : 685
    },
    "fs" : {
      "total_in_bytes" : 6401678557184,
      "free_in_bytes" : 6171263709184,
      "available_in_bytes" : 5845571399680
    },
    "plugins" : [
      {
        "name" : "repository-azure",
        "version" : "7.17.0",
        "elasticsearch_version" : "7.17.0",
        "java_version" : "1.8",
        "description" : "The Azure Repository plugin adds support for Azure storage repositories.",
        "classname" : "org.elasticsearch.repositories.azure.AzureRepositoryPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false,
        "licensed" : false,
        "type" : "isolated"
      },
      {
        "name" : "ingest-attachment",
        "version" : "7.17.0",
        "elasticsearch_version" : "7.17.0",
        "java_version" : "1.8",
        "description" : "Ingest processor that uses Apache Tika to extract contents",
        "classname" : "org.elasticsearch.ingest.attachment.IngestAttachmentPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false,
        "licensed" : false,
        "type" : "isolated"
      }
    ],
    "network_types" : {
      "transport_types" : {
        "security4" : 8
      },
      "http_types" : {
        "security4" : 8
      }
    },
    "discovery_types" : {
      "zen" : 8
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "deb",
        "count" : 8
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 3,
      "processor_stats" : {
        "attachment" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "script" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        }
      }
    }
  }
}

I think this is a bug and opened Tragic failure of primary marks replicas as stale · Issue #101180 · elastic/elasticsearch · GitHub.

Could you share a complete set of logs, including stack traces, for about 5 minutes either side of those failed shard log messages at 2023-10-18T08:41:30? That'd help us pin down the details more easily.

1 Like

Logs are too long. Is there a way to attach file here?

Can you share them using https://gist.github.com/?

Please find it here

@DavidTurner looks like it still truncated the file

Thanks, that covers it. The stack trace is not that useful unfortunately but better than nothing. I've added the relevant info to the Github issue.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.