Hi all,
I saw an unusual issue in our cluster where one of the indices configured with 1p:2r turned red when the node with primary shard went down. By the time I was checking the node was already back in cluster and the state went back to green. The cluster remained red state for around 6-8 minutes at least and only turned green since the down node was back as far as I understand.
I have added the logs related to the index below (abc__events-2023.10.18). The problem looks to be starting around "2023-10-18T08:41:30,766".
Can anyone suggest any ideas or thoughts on why it went red while it should only be yellow with multiple replicas available on other nodes?
[2023-10-18T00:00:32,140][INFO ][o.e.c.m.MetadataCreateIndexService] [elastic-node-eastus2-3-vm-0] [abc__events-2023.10.18] creating index, cause [auto(bulk api)], templates [events-template], shards [1]/[2]
[2023-10-18T00:00:33,364][INFO ][o.e.c.r.a.AllocationService] [elastic-node-eastus2-3-vm-0] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[abc__events-2023.10.18][0]]]).
[2023-10-18T08:41:30,766][WARN ][o.e.c.r.a.AllocationService] [elastic-node-eastus2-3-vm-0] failing shard [failed shard, shard [abc__events-2023.10.18][0], node[Jk9Y9EtDSmS_XXXXXX], [R], s[STARTED], a[id=gfggfYtfR7ajKXd4kj73Gg], message [failed to perform indices:data/write/bulk[s] on replica [abc__events-2023.10.18][0], node[Jk9Y9EtDSmS_XXXXXX], [R], s[STARTED], a[id=gfggfYtfR7ajKXd4kj73Gg]], failure [IndexShardClosedException[CurrentState[CLOSED] Primary closed.]], markAsStale [true]]
[2023-10-18T08:41:31,023][WARN ][o.e.c.r.a.AllocationService] [elastic-node-eastus2-3-vm-0] failing shard [failed shard, shard [abc__events-2023.10.18][0], node[pA_ggBT3Q9mmUdXXXX], [R], s[STARTED], a[id=tH7xi2bAQAi1ce2PwJ5ylQ], message [failed to perform indices:data/write/bulk[s] on replica [abc__events-2023.10.18][0], node[pA_ggBT3Q9mmUdXXXX], [R], s[STARTED], a[id=tH7xi2bAQAi1ce2PwJ5ylQ]], failure [IndexShardClosedException[CurrentState[CLOSED] Primary closed.]], markAsStale [true]]
[2023-10-18T08:41:31,024][WARN ][o.e.c.r.a.AllocationService] [elastic-node-eastus2-3-vm-0] failing shard [failed shard, shard [abc__events-2023.10.18][0], node[TN3qE5YWRMaji3pWXXXX], [P], s[STARTED], a[id=lCe4tac4Q-q8pl0gRMTETw], message [shard failure, reason [already closed by tragic event on the translog]], failure [IOException[Read-only file system]], markAsStale [true]]
[2023-10-18T08:41:31,032][INFO ][o.e.c.r.a.AllocationService] [elastic-node-eastus2-3-vm-0] Cluster health status changed from [YELLOW] to [RED] (reason: [shards failed [[abc__events-2023.10.18][0], [abc__events-2023.10.18][0]]]).
[2023-10-18T08:42:00,633][WARN ][r.suppressed ] [elastic-node-eastus2-3-vm-0] path: /abc__events-2023.10.18/_search, params: {typed_keys=true, max_concurrent_shard_requests=5, ignore_unavailable=true, expand_wildcards=open,closed, allow_no_indices=true, index=abc__events-2023.10.18, search_type=query_then_fetch, batched_reduce_size=512}
[2023-10-18T08:43:01,869][WARN ][r.suppressed ] [elastic-node-eastus2-3-vm-0] path: /abc__events-2023.10.18/_search, params: {typed_keys=true, max_concurrent_shard_requests=5, ignore_unavailable=true, expand_wildcards=open,closed, allow_no_indices=true, index=abc__events-2023.10.18, search_type=query_then_fetch, batched_reduce_size=512}
[2023-10-18T08:43:41,014][WARN ][r.suppressed ] [elastic-node-eastus2-3-vm-0] path: /abc__events-2023.10.18/_search, params: {typed_keys=true, max_concurrent_shard_requests=5, ignore_unavailable=true, expand_wildcards=open,closed, allow_no_indices=true, index=abc__events-2023.10.18, search_type=query_then_fetch, batched_reduce_size=512}
[2023-10-18T08:47:01,243][WARN ][r.suppressed ] [elastic-node-eastus2-3-vm-0] path: /abc__events-2023.10.18/_search, params: {typed_keys=true, max_concurrent_shard_requests=5, ignore_unavailable=true, expand_wildcards=open,closed, allow_no_indices=true, index=abc__events-2023.10.18, search_type=query_then_fetch, batched_reduce_size=512}
[2023-10-18T08:47:14,527][WARN ][r.suppressed ] [elastic-node-eastus2-3-vm-0] path: /abc__events-2023.10.18/_search, params: {typed_keys=true, max_concurrent_shard_requests=5, ignore_unavailable=true, expand_wildcards=open,closed, allow_no_indices=true, index=abc__events-2023.10.18, search_type=query_then_fetch, batched_reduce_size=512}
[2023-10-18T08:48:01,088][WARN ][r.suppressed ] [elastic-node-eastus2-3-vm-0] path: /abc__events-2023.10.18/_search, params: {typed_keys=true, max_concurrent_shard_requests=5, ignore_unavailable=true, expand_wildcards=open,closed, allow_no_indices=true, index=abc__events-2023.10.18, search_type=query_then_fetch, batched_reduce_size=512}
[2023-10-18T08:48:27,848][INFO ][o.e.c.r.a.AllocationService] [elastic-node-eastus2-3-vm-0] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[abc__events-2023.10.18][0]]]).
#Shards for index while checking later
index shard prirep state docs store ip node
abc__events-2023.10.18 0 r STARTED 2236914 694mb 192.168.XX.XX elastic-node-eastus2-3-vm-0
abc__events-2023.10.18 0 p STARTED 2236914 691.6mb 192.168.XX.XX elastic-node-central-1-vm-0
abc__events-2023.10.18 0 r STARTED 2236914 691.6mb 192.168.XX.XX elastic-node-central-3-vm-0