Index with multiple replicas turned red when node with primary went down

jaykb77 · October 20, 2023, 6:39am

Hi all,

I saw an unusual issue in our cluster where one of the indices configured with 1p:2r turned red when the node with primary shard went down. By the time I was checking the node was already back in cluster and the state went back to green. The cluster remained red state for around 6-8 minutes at least and only turned green since the down node was back as far as I understand.

I have added the logs related to the index below (abc__events-2023.10.18). The problem looks to be starting around "2023-10-18T08:41:30,766".

Can anyone suggest any ideas or thoughts on why it went red while it should only be yellow with multiple replicas available on other nodes?

[2023-10-18T00:00:32,140][INFO ][o.e.c.m.MetadataCreateIndexService] [elastic-node-eastus2-3-vm-0] [abc__events-2023.10.18] creating index, cause [auto(bulk api)], templates [events-template], shards [1]/[2]
[2023-10-18T00:00:33,364][INFO ][o.e.c.r.a.AllocationService] [elastic-node-eastus2-3-vm-0] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[abc__events-2023.10.18][0]]]).
[2023-10-18T08:41:30,766][WARN ][o.e.c.r.a.AllocationService] [elastic-node-eastus2-3-vm-0] failing shard [failed shard, shard [abc__events-2023.10.18][0], node[Jk9Y9EtDSmS_XXXXXX], [R], s[STARTED], a[id=gfggfYtfR7ajKXd4kj73Gg], message [failed to perform indices:data/write/bulk[s] on replica [abc__events-2023.10.18][0], node[Jk9Y9EtDSmS_XXXXXX], [R], s[STARTED], a[id=gfggfYtfR7ajKXd4kj73Gg]], failure [IndexShardClosedException[CurrentState[CLOSED] Primary closed.]], markAsStale [true]]
[2023-10-18T08:41:31,023][WARN ][o.e.c.r.a.AllocationService] [elastic-node-eastus2-3-vm-0] failing shard [failed shard, shard [abc__events-2023.10.18][0], node[pA_ggBT3Q9mmUdXXXX], [R], s[STARTED], a[id=tH7xi2bAQAi1ce2PwJ5ylQ], message [failed to perform indices:data/write/bulk[s] on replica [abc__events-2023.10.18][0], node[pA_ggBT3Q9mmUdXXXX], [R], s[STARTED], a[id=tH7xi2bAQAi1ce2PwJ5ylQ]], failure [IndexShardClosedException[CurrentState[CLOSED] Primary closed.]], markAsStale [true]]
[2023-10-18T08:41:31,024][WARN ][o.e.c.r.a.AllocationService] [elastic-node-eastus2-3-vm-0] failing shard [failed shard, shard [abc__events-2023.10.18][0], node[TN3qE5YWRMaji3pWXXXX], [P], s[STARTED], a[id=lCe4tac4Q-q8pl0gRMTETw], message [shard failure, reason [already closed by tragic event on the translog]], failure [IOException[Read-only file system]], markAsStale [true]]
[2023-10-18T08:41:31,032][INFO ][o.e.c.r.a.AllocationService] [elastic-node-eastus2-3-vm-0] Cluster health status changed from [YELLOW] to [RED] (reason: [shards failed [[abc__events-2023.10.18][0], [abc__events-2023.10.18][0]]]).
[2023-10-18T08:42:00,633][WARN ][r.suppressed             ] [elastic-node-eastus2-3-vm-0] path: /abc__events-2023.10.18/_search, params: {typed_keys=true, max_concurrent_shard_requests=5, ignore_unavailable=true, expand_wildcards=open,closed, allow_no_indices=true, index=abc__events-2023.10.18, search_type=query_then_fetch, batched_reduce_size=512}
[2023-10-18T08:43:01,869][WARN ][r.suppressed             ] [elastic-node-eastus2-3-vm-0] path: /abc__events-2023.10.18/_search, params: {typed_keys=true, max_concurrent_shard_requests=5, ignore_unavailable=true, expand_wildcards=open,closed, allow_no_indices=true, index=abc__events-2023.10.18, search_type=query_then_fetch, batched_reduce_size=512}
[2023-10-18T08:43:41,014][WARN ][r.suppressed             ] [elastic-node-eastus2-3-vm-0] path: /abc__events-2023.10.18/_search, params: {typed_keys=true, max_concurrent_shard_requests=5, ignore_unavailable=true, expand_wildcards=open,closed, allow_no_indices=true, index=abc__events-2023.10.18, search_type=query_then_fetch, batched_reduce_size=512}
[2023-10-18T08:47:01,243][WARN ][r.suppressed             ] [elastic-node-eastus2-3-vm-0] path: /abc__events-2023.10.18/_search, params: {typed_keys=true, max_concurrent_shard_requests=5, ignore_unavailable=true, expand_wildcards=open,closed, allow_no_indices=true, index=abc__events-2023.10.18, search_type=query_then_fetch, batched_reduce_size=512}
[2023-10-18T08:47:14,527][WARN ][r.suppressed             ] [elastic-node-eastus2-3-vm-0] path: /abc__events-2023.10.18/_search, params: {typed_keys=true, max_concurrent_shard_requests=5, ignore_unavailable=true, expand_wildcards=open,closed, allow_no_indices=true, index=abc__events-2023.10.18, search_type=query_then_fetch, batched_reduce_size=512}
[2023-10-18T08:48:01,088][WARN ][r.suppressed             ] [elastic-node-eastus2-3-vm-0] path: /abc__events-2023.10.18/_search, params: {typed_keys=true, max_concurrent_shard_requests=5, ignore_unavailable=true, expand_wildcards=open,closed, allow_no_indices=true, index=abc__events-2023.10.18, search_type=query_then_fetch, batched_reduce_size=512}
[2023-10-18T08:48:27,848][INFO ][o.e.c.r.a.AllocationService] [elastic-node-eastus2-3-vm-0] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[abc__events-2023.10.18][0]]]).

#Shards for index while checking later

index                           shard prirep state      docs   store ip           node
abc__events-2023.10.18 0     r      STARTED 2236914   694mb 192.168.XX.XX elastic-node-eastus2-3-vm-0
abc__events-2023.10.18 0     p      STARTED 2236914 691.6mb 192.168.XX.XX elastic-node-central-1-vm-0
abc__events-2023.10.18 0     r      STARTED 2236914 691.6mb 192.168.XX.XX elastic-node-central-3-vm-0

Christian_Dahlqvist · October 20, 2023, 7:15am

Which version of Elasticsearch are you using?

jaykb77 · October 20, 2023, 9:35am

7.17.8

Christian_Dahlqvist · October 20, 2023, 9:37am

What is the full output of the cluster stats API?

jaykb77 · October 20, 2023, 11:14am

{
  "_nodes" : {
    "total" : 8,
    "successful" : 8,
    "failed" : 0
  },
  "cluster_name" : "env-elastic",
  "cluster_uuid" : "VODX2vTFRQKKqgXXXXX",
  "timestamp" : 1697800268559,
  "status" : "green",
  "indices" : {
    "count" : 251,
    "shards" : {
      "total" : 592,
      "primaries" : 251,
      "replication" : 1.3585657370517927,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 3,
          "avg" : 2.358565737051793
        },
        "primaries" : {
          "min" : 1,
          "max" : 1,
          "avg" : 1.0
        },
        "replication" : {
          "min" : 1.0,
          "max" : 2.0,
          "avg" : 1.3585657370517927
        }
      }
    },
    "docs" : {
      "count" : 229304320,
      "deleted" : 162439
    },
    "store" : {
      "size_in_bytes" : 228179253656,
      "total_data_set_size_in_bytes" : 228179253656,
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size_in_bytes" : 234912,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 540185560,
      "total_count" : 245686371,
      "hit_count" : 40125053,
      "miss_count" : 205561318,
      "cache_size" : 463451,
      "cache_count" : 1215293,
      "evictions" : 751842
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 6805,
      "memory_in_bytes" : 21493414,
      "terms_memory_in_bytes" : 10709648,
      "stored_fields_memory_in_bytes" : 4493400,
      "term_vectors_memory_in_bytes" : 141152,
      "norms_memory_in_bytes" : 510144,
      "points_memory_in_bytes" : 0,
      "doc_values_memory_in_bytes" : 5639070,
      "index_writer_memory_in_bytes" : 852264708,
      "version_map_memory_in_bytes" : 85683332,
      "fixed_bit_set_memory_in_bytes" : 0,
      "max_unsafe_auto_id_timestamp" : 1697618907933,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "binary",
          "count" : 30,
          "index_count" : 30,
          "script_count" : 0
        },
        {
          "name" : "boolean",
          "count" : 128,
          "index_count" : 80,
          "script_count" : 0
        },
        {
          "name" : "constant_keyword",
          "count" : 6,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "date",
          "count" : 236,
          "index_count" : 176,
          "script_count" : 0
        },
        {
          "name" : "integer",
          "count" : 34,
          "index_count" : 34,
          "script_count" : 0
        },
        {
          "name" : "ip",
          "count" : 14,
          "index_count" : 14,
          "script_count" : 0
        },
        {
          "name" : "keyword",
          "count" : 1350,
          "index_count" : 249,
          "script_count" : 0
        },
        {
          "name" : "long",
          "count" : 212,
          "index_count" : 107,
          "script_count" : 0
        },
        {
          "name" : "object",
          "count" : 71,
          "index_count" : 51,
          "script_count" : 0
        },
        {
          "name" : "text",
          "count" : 943,
          "index_count" : 138,
          "script_count" : 0
        }
      ],
      "runtime_field_types" : [
        {
          "name" : "date",
          "count" : 94,
          "index_count" : 10,
          "scriptless_count" : 94,
          "shadowed_count" : 0,
          "lang" : [ ],
          "lines_max" : 0,
          "lines_total" : 0,
          "chars_max" : 0,
          "chars_total" : 0,
          "source_max" : 0,
          "source_total" : 0,
          "doc_max" : 0,
          "doc_total" : 0
        },
        {
          "name" : "keyword",
          "count" : 20701,
          "index_count" : 12,
          "scriptless_count" : 20701,
          "shadowed_count" : 0,
          "lang" : [ ],
          "lines_max" : 0,
          "lines_total" : 0,
          "chars_max" : 0,
          "chars_total" : 0,
          "source_max" : 0,
          "source_total" : 0,
          "doc_max" : 0,
          "doc_total" : 0
        }
      ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [
        {
          "name" : "edgeNGram",
          "count" : 59,
          "index_count" : 59
        },
        {
          "name" : "edge_ngram",
          "count" : 18,
          "index_count" : 18
        },
        {
          "name" : "nGram",
          "count" : 11,
          "index_count" : 11
        },
        {
          "name" : "ngram",
          "count" : 12,
          "index_count" : 12
        },
        {
          "name" : "pattern_capture",
          "count" : 25,
          "index_count" : 25
        },
        {
          "name" : "pattern_replace",
          "count" : 33,
          "index_count" : 25
        },
        {
          "name" : "stemmer",
          "count" : 44,
          "index_count" : 44
        },
        {
          "name" : "stop",
          "count" : 224,
          "index_count" : 127
        },
        {
          "name" : "word_delimiter",
          "count" : 53,
          "index_count" : 53
        },
        {
          "name" : "word_delimiter_graph",
          "count" : 21,
          "index_count" : 21
        }
      ],
      "analyzer_types" : [
        {
          "name" : "custom",
          "count" : 393,
          "index_count" : 127
        },
        {
          "name" : "pattern",
          "count" : 24,
          "index_count" : 24
        }
      ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [
        {
          "name" : "classic",
          "count" : 12,
          "index_count" : 12
        },
        {
          "name" : "standard",
          "count" : 151,
          "index_count" : 79
        },
        {
          "name" : "whitespace",
          "count" : 230,
          "index_count" : 75
        }
      ],
      "built_in_filters" : [
        {
          "name" : "cjk_bigram",
          "count" : 8,
          "index_count" : 8
        },
        {
          "name" : "cjk_width",
          "count" : 8,
          "index_count" : 8
        },
        {
          "name" : "lowercase",
          "count" : 393,
          "index_count" : 127
        }
      ],
      "built_in_analyzers" : [ ]
    },
    "versions" : [
      {
        "version" : "7.5.2",
        "index_count" : 97,
        "primary_shard_count" : 97,
        "total_primary_bytes" : 3133320734
      },
      {
        "version" : "7.17.0",
        "index_count" : 154,
        "primary_shard_count" : 154,
        "total_primary_bytes" : 79815253712
      }
    ]
  },
  "nodes" : {
    "count" : {
      "total" : 8,
      "coordinating_only" : 1,
      "data" : 6,
      "data_cold" : 6,
      "data_content" : 6,
      "data_frozen" : 6,
      "data_hot" : 6,
      "data_warm" : 6,
      "ingest" : 6,
      "master" : 7,
      "ml" : 6,
      "remote_cluster_client" : 6,
      "transform" : 6,
      "voting_only" : 1
    },
    "versions" : [
      "7.17.0"
    ],
    "os" : {
      "available_processors" : 52,
      "allocated_processors" : 52,
      "names" : [
        {
          "name" : "Linux",
          "count" : 8
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "Ubuntu 20.04.5 LTS",
          "count" : 8
        }
      ],
      "architectures" : [
        {
          "arch" : "amd64",
          "count" : 8
        }
      ],
      "mem" : {
        "total_in_bytes" : 218623082496,
        "free_in_bytes" : 4690100224,
        "used_in_bytes" : 213932982272,
        "free_percent" : 2,
        "used_percent" : 98
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 2
      },
      "open_file_descriptors" : {
        "min" : 504,
        "max" : 1543,
        "avg" : 1184
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 6310232868,
      "versions" : [
        {
          "version" : "17.0.1",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "17.0.1+12",
          "vm_vendor" : "Eclipse Adoptium",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 8
        }
      ],
      "mem" : {
        "heap_used_in_bytes" : 46620832296,
        "heap_max_in_bytes" : 111669149696
      },
      "threads" : 685
    },
    "fs" : {
      "total_in_bytes" : 6401678557184,
      "free_in_bytes" : 6171263709184,
      "available_in_bytes" : 5845571399680
    },
    "plugins" : [
      {
        "name" : "repository-azure",
        "version" : "7.17.0",
        "elasticsearch_version" : "7.17.0",
        "java_version" : "1.8",
        "description" : "The Azure Repository plugin adds support for Azure storage repositories.",
        "classname" : "org.elasticsearch.repositories.azure.AzureRepositoryPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false,
        "licensed" : false,
        "type" : "isolated"
      },
      {
        "name" : "ingest-attachment",
        "version" : "7.17.0",
        "elasticsearch_version" : "7.17.0",
        "java_version" : "1.8",
        "description" : "Ingest processor that uses Apache Tika to extract contents",
        "classname" : "org.elasticsearch.ingest.attachment.IngestAttachmentPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false,
        "licensed" : false,
        "type" : "isolated"
      }
    ],
    "network_types" : {
      "transport_types" : {
        "security4" : 8
      },
      "http_types" : {
        "security4" : 8
      }
    },
    "discovery_types" : {
      "zen" : 8
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "deb",
        "count" : 8
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 3,
      "processor_stats" : {
        "attachment" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "script" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        }
      }
    }
  }
}

DavidTurner · October 20, 2023, 1:36pm

I think this is a bug and opened Tragic failure of primary marks replicas as stale · Issue #101180 · elastic/elasticsearch · GitHub.

Could you share a complete set of logs, including stack traces, for about 5 minutes either side of those failed shard log messages at 2023-10-18T08:41:30? That'd help us pin down the details more easily.

jaykb77 · October 21, 2023, 2:33pm

Logs are too long. Is there a way to attach file here?

DavidTurner · October 21, 2023, 2:45pm

Can you share them using https://gist.github.com/?

jaykb77 · October 21, 2023, 4:19pm

Please find it here

gist.github.com

https://gist.github.com/Jaykb77/978e4c95090249cf27cb37d0b9e23e83

env-elastic-2023-10-18.log

[2023-10-18T08:38:41,904][INFO ][o.e.c.s.MasterService    ] [env-eastus2-elastic-masters-eastus2-3-vm-0] node-join[{env-centralus-elastic-voting-central-1-vm-0}{mrLmlqrgSPmBACpmc7PjxQ}{OX0f-wWCQiq3ahcm-mAYsg}{192.168.XX.XX}{192.168.XX.XX:9300} join existing leader], term: 14, version: 142420, delta: added {{env-centralus-elastic-voting-central-1-vm-0}{mrLmlqrgSPmBACpmc7PjxQ}{OX0f-wWCQiq3ahcm-mAYsg}{192.168.XX.XX}{192.168.XX.XX:9300}}
[2023-10-18T08:38:44,207][INFO ][o.e.c.s.ClusterApplierService] [env-eastus2-elastic-masters-eastus2-3-vm-0] added {{env-centralus-elastic-voting-central-1-vm-0}{mrLmlqrgSPmBACpmc7PjxQ}{OX0f-wWCQiq3ahcm-mAYsg}{192.168.XX.XX}{192.168.XX.XX:9300}}, term: 14, version: 142420, reason: Publication{term=14, version=142420}
[2023-10-18T08:40:23,916][INFO ][o.e.x.s.a.TransportPutSnapshotLifecycleAction] [env-eastus2-elastic-masters-eastus2-3-vm-0] updating existing snapshot lifecycle [env-elastic-slm1]
[2023-10-18T08:40:24,097][INFO ][o.e.x.s.SnapshotLifecycleService] [env-eastus2-elastic-masters-eastus2-3-vm-0] rescheduling updated snapshot lifecycle job [env-elastic-slm1-2]
[2023-10-18T08:40:51,293][INFO ][o.e.c.m.MetadataMappingService] [env-eastus2-elastic-masters-eastus2-3-vm-0] [index_1_16.10.2023/VS9pb6J3SjqjvHduH3ENxQ] update_mapping [_doc]
[2023-10-18T08:41:14,746][INFO ][o.e.c.m.MetadataMappingService] [env-eastus2-elastic-masters-eastus2-3-vm-0] [index_1_16.10.2023/VS9pb6J3SjqjvHduH3ENxQ] update_mapping [_doc]
[2023-10-18T08:41:29,567][WARN ][o.e.c.r.a.AllocationService] [env-eastus2-elastic-masters-eastus2-3-vm-0] failing shard [failed shard, shard [index_1_16.10.2023][0], node[TN3qE5YWRMaji3pWgdVq4Q], [R], s[STARTED], a[id=WPifAxMlR5Sa5dmFimypVA], message [failed to perform indices:data/write/bulk[s] on replica [index_1_16.10.2023][0], node[TN3qE5YWRMaji3pWgdVq4Q], [R], s[STARTED], a[id=WPifAxMlR5Sa5dmFimypVA]], failure [RemoteTransportException[[env-centralus-elastic-masters-central-1-vm-0][192.168.XX.XX:9300][indices:data/write/bulk[s][r]]]; nested: IOException[Read-only file system]; ], markAsStale [true]]
org.elasticsearch.transport.RemoteTransportException: [env-centralus-elastic-masters-central-1-vm-0][192.168.XX.XX:9300][indices:data/write/bulk[s][r]]
Caused by: java.io.IOException: Read-only file system
	at sun.nio.ch.FileDispatcherImpl.force0(Native Method) ~[?:?]

This file has been truncated. show original

jaykb77 · October 21, 2023, 4:23pm

@DavidTurner looks like it still truncated the file

DavidTurner · October 21, 2023, 6:46pm

Thanks, that covers it. The stack trace is not that useful unfortunately but better than nothing. I've added the relevant info to the Github issue.

system · November 18, 2023, 6:47pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
3 indices never turns to yellow or green? Elasticsearch	6	2655	July 5, 2017
3 nodes ES 2.3.2 cluster with Replica 2 goes to red state after bringing down whole cluster and starting only a single node Elasticsearch	5	855	June 22, 2017
Cluster Status Red Elasticsearch	1	365	July 5, 2017
Elasticsearch losing shards Elasticsearch	5	613	January 2, 2017
Confirming exact definition of "red" and "yellow" for index status for a multi data node cluster Elasticsearch	4	404	November 30, 2018

Index with multiple replicas turned red when node with primary went down

Related topics