Readiness check for Elasticsearch coordinator nodes behind a load balancer?

Hello,

I am trying to choose the right readiness check for Elasticsearch coordinator nodes behind a load balancer / proxy.

We are on Elasticsearch 8.8.2. The coorinator nodes serve search traffic behind a load balancer. The current health check is shallow: TCP on port 9200 and/or GET /.

I found older guidance suggesting that GET / is enough for a load balancer health check. In my case, that seems true for liveness, but not necessarily for readiness. What I observed during node restart / bootstrap is:

  • the node starts accepting TCP on :9200
  • GET / returns 200
  • but for a short window, real search requests can still fail with:
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];

So my question is: what is the recommended corrdinator node-local readiness check if I want to avoid routing live search traffic to a node until it has rejoined / recovered enough to serve /_search successfully?
I do not want a cluster-wide check that could mark all nodes unhealthy just because the cluster is yellow/red elsewhere.

Welcome to the forum @sagar_cenation

I don't know best "readiness" check logic/settings for your LB. Hopefully someone else can advise there.

yellow and red are completely different animals. red is ... bad.

yellow has a specific meaning, one or more index's shards is currently missing a replica. But all indices should be writeable/searchable if cluster state is yellow, as (eg) happens naturally when a rolling restart is ongoing.

org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];

As far as I knew, that exception should not appear when the cluster state is green or yellow? Am I just wrong on this? node N might generate this exception even after node M has returned "cluster is yellow/green" ? I am happy to be corrected if so, every day is a school day!

I note in passing that the /_cluster/health endpoint has ?wait_for_status=X arg, where X can be green or yellow, and there is also the per-index call as well, /_cluster/health/index-name, and should only return a 200 if the desired state or better has been reached.

Thanks for reverting back.
that exception occurred not during a cluster status being changed but sudden addition of multiple router nodes

This is the current scenario:
The clients connects to the coordinator nodes via envoy and uses http healthcheck GET / on 9200 port to check if the nodes are healthy to serve the traffic. What can we update here?

Thanks for clarifying.

As I said, I am not able to confirm your best options with elastic behind your (envoy) load balancer, and even if it were haproxy or another LB I still would not know, especially given sudden and significant changes in cluster topology. You are certainly right that a GET to / returning 200 does not mean "ready for indexing/querying/....".

I hope someone else can assist, and I wish you luck. Grateful for a well written problem description too.

I came across an interesting change - Introduce an unauthenticated endpoint for readiness checks by grcevski · Pull Request #84375 · elastic/elasticsearch · GitHub

I wonder if we could use this as readiness check.

Also Kevin, how do you guys manage the readiness check in your clusters?

At startup this doesn't really do much more than checking it's listening over HTTP - the two things start at pretty much the same time:

This is after waiting for discovery.initial_state_timeout (defaults 30s) to find the elected master and join the cluster. Are your nodes taking longer than that?

The nodes find the master quickly (well under 30s). The gap is roughly 10–13s between HTTP becoming available and the STATE_NOT_RECOVERED/1 block being cleared — so by the time GET / returns 200, the node is still mid-recovery.

What health check endpoint would you recommend to gate traffic until that block is cleared?

That's very surprising. Are you sure you're just adding a node to an existing running cluster?

After digging deeper into the logs, I should refine my earlier description.

This was not a simple case of adding a coordinator to an otherwise healthy running cluster. It happened during a broader bootstrap/restart event where multiple router nodes were joining/rejoining around the same time.

What the logs do show clearly is this sequence on the clearest node:

  1. Our load balancer health signal flipped healthy at about 12:08:29
  2. first live /_search 503 at about 12:08:32
    1. matching ES error: ClusterBlockException: blocked by [SERVICE_UNAVAILABLE/1/state not recovered / initialized]
  3. the master logs showed the node rejoining at about 12:08:42
  4. first successful /_search at about 12:08:45

So traffic was admitted to newly provisioned coordinator nodes before their local ES process had fully rejoined cluster state; these nodes had also rebooted during bootstrap.. This was not isolated to one node; we saw the same general sequence on 5 affected router nodes during the same event.

Given that, for coordinator nodes, what would be the best practical LB readiness signal here? Would readiness.port still be better than GET / in this kind of case?

You're describing the situation in a very confusing way. I think it is much deeper than just some nodes "joining/rejoining" (i.e. to an existing and healthy cluster). I'm pretty sure you're bringing the whole cluster up from a cold start. If so, you will experience failures until GET _cluster/health reports green.

Thanks. One correction: this was not a cold start of the whole cluster. The production cluster was already live and serving traffic.

What happened was a scale-out/bootstrap of multiple new coordinator/router nodes, and some of those new nodes were admitted by the LB before their local ES process had recovered enough cluster state to serve `/_search`. So the failure mode here was node-local bootstrap/readiness on new coordinators, not cluster-wide recovery to green.

Ok if so this is surprising. When an Elasticsearch process starts, it waits to receive and apply the latest cluster state before it even start to listen for requests over HTTP.

What version are you running? (edit: nvm I saw "We are on Elasticsearch 8.8.2" in the OP)

Can you share the logs from startup until the first successful search at 12:08:45?

Thanks. We are running the Elasticsearch - 8.8.2.
I cannot post the raw export publicly because it contains internal details.
Below is a sanitized excerpt from one affected coordinating node. Please note that the final 200 line comes from a separate access-log stream for the same node.

# Startup on this coordinating node (boot relevant to LB timing)
[2026-05-29T12:07:48,750][INFO ][o.e.n.Node               ] [coord-node] version[8.8.2], pid[3259], build[tar/98e1271edf932a480e4262a471281f1ee295ce6b/2023-06-26T05:16:16.196344851Z], OS[Linux/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/20.0.1/20.0.1+9-29]
[2026-05-29T12:07:58,423][INFO ][o.e.t.TransportService   ] [coord-node] publish_address {<PRIVATE_IP>:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}, {<PRIVATE_IP>:9300}
[2026-05-29T12:08:28,534][INFO ][o.e.h.AbstractHttpServerTransport] [coord-node] publish_address {<PRIVATE_IP>:9200}, bound_addresses {[::1]:9200}, {127.0.0.1:9200}, {<PRIVATE_IP>:9200}
[2026-05-29T12:08:28,535][INFO ][o.e.n.Node               ] [coord-node] started {[coord-node]}{<NODE_ID>}{<EPHEMERAL_ID>}{coord-node}{<PRIVATE_IP>}{<PRIVATE_IP>:9300}{r}{8.8.2}{<NODE_ATTRS>}

# Gap: HTTP is already bound and Node logged "started", but client searches still hit cluster read block
[2026-05-29T12:08:32,104][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: {allow_partial_search_results=false, index=[redacted-index], routing=<REDACTED>, timeout=10s, track_total_hits=false} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];

[2026-05-29T12:08:41,139][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];

[2026-05-29T12:08:42,132][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];

[2026-05-29T12:08:43,764][INFO ][o.e.c.s.ClusterApplierService] [coord-node] added {{other-router}{<NODE_ID>}{<EPHEMERAL_ID>}{other-router}{<PRIVATE_IP>}{<PRIVATE_IP>:9300}{r}{8.8.2}}, term: <TERM>, version: <VERSION>, reason: ApplyCommitRequest{term=<TERM>, version=<VERSION>, sourceNode={master-node}{<NODE_ID>}{<EPHEMERAL_ID>}{master-node}{<PRIVATE_IP>}{<PRIVATE_IP>:9300}{m}{8.8.2}{<NODE_ATTRS>}}

# From our edge/access path for the same node class (separate pipeline; same incident window):
[2026-05-29T12:08:45,184Z][INFO ][sanitized-access-log    ] [coord-node] POST /_search -> 200, upstream_host=127.0.0.1:9200

Sorry you've elided too many messages, this isn't helpful. Can you share all the messages, redacting as little as possible?

Please find the logs attached:

[2026-05-29T12:06:57,164][INFO ][o.e.n.Node               ] [coord-node] version[8.8.2], pid[16278], build[tar/98e1271edf932a480e4262a471281f1ee295ce6b/2023-06-26T05:16:16.196344851Z], OS[Linux/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/20.0.1/20.0.1+9-29]
[2026-05-29T12:07:01,652][WARN ][stderr                   ] [coord-node] May 29, 2026 12:07:01 PM org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
[2026-05-29T12:07:01,659][INFO ][o.e.e.NodeEnvironment    ] [coord-node] using [1] data paths, mounts [[/esdata (/dev/mapper/es-data-lv)]], net usable_space [185.7gb], net total_space [191.8gb], types [ext4]
[2026-05-29T12:07:01,660][INFO ][o.e.e.NodeEnvironment    ] [coord-node] heap size [31gb], compressed ordinary object pointers [true]
[2026-05-29T12:07:01,670][INFO ][o.e.n.Node               ] [coord-node] node name [coord-node], node ID [<ID>], cluster name [<CLUSTER_NAME>], roles [remote_cluster_client]
[2026-05-29T12:07:03,247][INFO ][o.e.x.s.Security         ] [coord-node] Security is enabled
[2026-05-29T12:07:03,817][INFO ][o.e.x.p.ProfilingPlugin  ] [coord-node] Profiling is enabled
[2026-05-29T12:07:04,186][INFO ][o.e.t.n.NettyAllocator   ] [coord-node] creating NettyAllocator with the following configs: [ name=elasticsearch_configured, chunk_size=1mb, factors={"false"=>true}, suggested_max_allocation_size=1mb
[2026-05-29T12:07:04,203][INFO ][o.e.i.r.RecoverySettings ] [coord-node] using rate limit [100mb] with [ default=100mb, max=0b, read=0b, write=0b
[2026-05-29T12:07:04,232][INFO ][o.e.d.DiscoveryModule    ] [coord-node] using discovery type [multi-node] and seed hosts providers [settings, ec2]
[2026-05-29T12:07:04,870][INFO ][o.e.n.Node               ] [coord-node] initialized
[2026-05-29T12:07:04,871][INFO ][o.e.n.Node               ] [coord-node] starting ...
[2026-05-29T12:07:04,877][INFO ][o.e.x.d.l.DeprecationIndexingComponent] [coord-node] deprecation component started
[2026-05-29T12:07:04,948][INFO ][o.e.t.TransportService   ] [coord-node] publish_address {<PRIVATE_IP>:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}, {<PRIVATE_IP>:9300}
[2026-05-29T12:07:05,065][INFO ][o.e.b.BootstrapChecks    ] [coord-node] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2026-05-29T12:07:05,067][INFO ][o.e.c.c.ClusterBootstrapService] [coord-node] this node has not joined a bootstrapped cluster yet; [cluster.initial_master_nodes] is set to []
[2026-05-29T12:07:15,164][INFO ][o.e.t.ClusterConnectionManager] [coord-node] transport connection to [{[other-router]}{<ID>}{<ID>}{[other-router]}{<PRIVATE_IP>}{<PRIVATE_IP>:9300}{r}{8.8.2}] closed by remote
[2026-05-29T12:07:15,180][INFO ][o.e.t.ClusterConnectionManager] [coord-node] transport connection to [{[other-router]}{<ID>}{<ID>}{[other-router]}{<PRIVATE_IP>}{<PRIVATE_IP>:9300}{r}{8.8.2}] closed by remote
[2026-05-29T12:07:15,999][INFO ][o.e.t.ClusterConnectionManager] [coord-node] transport connection to [{[other-router]}{<ID>}{<ID>}{[other-router]}{<PRIVATE_IP>}{<PRIVATE_IP>:9300}{r}{8.8.2}] closed by remote
[2026-05-29T12:07:16,060][INFO ][o.e.t.ClusterConnectionManager] [coord-node] transport connection to [{[other-router]}{<ID>}{<ID>}{[other-router]}{<PRIVATE_IP>}{<PRIVATE_IP>:9300}{r}{8.8.2}] closed by remote
[2026-05-29T12:07:16,287][INFO ][o.e.c.s.ClusterSettings  ] [coord-node] updating [cluster.routing.allocation.cluster_concurrent_rebalance] from [2] to [32]
[2026-05-29T12:07:16,288][INFO ][o.e.c.s.ClusterSettings  ] [coord-node] updating [cluster.routing.allocation.balance.index] from [0.55] to [1.5]
[2026-05-29T12:07:16,288][INFO ][o.e.c.s.ClusterSettings  ] [coord-node] updating [cluster.routing.allocation.balance.shard] from [0.45] to [0.25]
[2026-05-29T12:07:16,288][INFO ][o.e.c.s.ClusterSettings  ] [coord-node] updating [cluster.routing.allocation.node_concurrent_incoming_recoveries] from [30] to [10]
[2026-05-29T12:07:16,288][INFO ][o.e.c.s.ClusterSettings  ] [coord-node] updating [cluster.routing.allocation.node_concurrent_outgoing_recoveries] from [30] to [16]
[2026-05-29T12:07:16,288][INFO ][o.e.c.s.ClusterSettings  ] [coord-node] updating [cluster.routing.allocation.node_initial_primaries_recoveries] from [4] to [16]
[2026-05-29T12:07:16,288][INFO ][o.e.c.s.ClusterSettings  ] [coord-node] updating [indices.id_field_data.enabled] from [false] to [true]
[2026-05-29T12:07:16,288][INFO ][o.e.c.s.ClusterSettings  ] [coord-node] updating [indices.recovery.max_concurrent_file_chunks] from [2] to [8]
[2026-05-29T12:07:16,289][INFO ][o.e.c.s.ClusterSettings  ] [coord-node] updating [indices.recovery.max_concurrent_operations] from [1] to [4]
[2026-05-29T12:07:16,289][INFO ][o.e.c.s.ClusterSettings  ] [coord-node] updating [search.default_search_timeout] from [-1] to [60s]
[2026-05-29T12:07:16,293][INFO ][o.e.x.s.a.TokenService   ] [coord-node] refresh keys
[2026-05-29T12:07:16,404][INFO ][o.e.x.s.a.TokenService   ] [coord-node] refreshed keys
[2026-05-29T12:07:16,437][INFO ][o.e.h.AbstractHttpServerTransport] [coord-node] publish_address {<PRIVATE_IP>:9200}, bound_addresses {[::1]:9200}, {127.0.0.1:9200}, {<PRIVATE_IP>:9200}
[2026-05-29T12:07:16,438][INFO ][o.e.n.Node               ] [coord-node] started {[coord-node]}{<ID>}{<ID>}{[coord-node]}{<PRIVATE_IP>}{<PRIVATE_IP>:9300}{r}{8.8.2}{zone=<ZONE>, xpack.installed=true}
[2026-05-29T12:07:16,440][INFO ][o.e.n.Node               ] [coord-node] stopping ...
[2026-05-29T12:07:16,449][ERROR][o.e.i.g.GeoIpDownloader  ] [coord-node] failed to create geoip downloader task org.elasticsearch.node.NodeClosedException: node closed {[coord-node]}{<ID>}{<ID>}{[coord-node]}{<PRIVATE_IP>}{<PRIVATE_IP>:9300}{r}{8.8.2}{zone=<ZONE>, xpack.installed=true}
[2026-05-29T12:07:16,461][INFO ][o.e.c.c.JoinHelper       ] [coord-node] failed to join {[master-node]}{<ID>}{<ID>}{[master-node]}{<PRIVATE_IP>}{<PRIVATE_IP>:9300}{m}{8.8.2}; org.elasticsearch.transport.NodeDisconnectedException: [master-node] <PRIVATE_IP>:9300 internal:cluster/coordination/join disconnected
[2026-05-29T12:07:16,463][INFO ][o.e.c.c.Coordinator      ] [coord-node] master node [{[master-node]}{<ID>}{<ID>}{[master-node]}{<PRIVATE_IP>}{<PRIVATE_IP>:9300}{m}{8.8.2}] disconnected, restarting discovery
[2026-05-29T12:07:16,551][INFO ][o.e.n.Node               ] [coord-node] stopped
[2026-05-29T12:07:16,552][INFO ][o.e.n.Node               ] [coord-node] closing ...
[2026-05-29T12:07:16,562][INFO ][o.e.n.Node               ] [coord-node] closed
[2026-05-29T12:07:48,750][INFO ][o.e.n.Node               ] [coord-node] version[8.8.2], pid[3259], build[tar/98e1271edf932a480e4262a471281f1ee295ce6b/2023-06-26T05:16:16.196344851Z], OS[Linux/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/20.0.1/20.0.1+9-29]
[2026-05-29T12:07:54,851][WARN ][stderr                   ] [coord-node] May 29, 2026 12:07:54 PM org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
[2026-05-29T12:07:54,856][INFO ][o.e.e.NodeEnvironment    ] [coord-node] heap size [31gb], compressed ordinary object pointers [true]
[2026-05-29T12:07:54,856][INFO ][o.e.e.NodeEnvironment    ] [coord-node] using [1] data paths, mounts [[/esdata (/dev/mapper/es-data-lv)]], net usable_space [185.3gb], net total_space [191.8gb], types [ext4]
[2026-05-29T12:07:54,896][INFO ][o.e.n.Node               ] [coord-node] node name [coord-node], node ID [<ID>], cluster name [<CLUSTER_NAME>], roles [remote_cluster_client]
[2026-05-29T12:07:56,600][INFO ][o.e.x.s.Security         ] [coord-node] Security is enabled
[2026-05-29T12:07:57,167][INFO ][o.e.x.p.ProfilingPlugin  ] [coord-node] Profiling is enabled
[2026-05-29T12:07:57,621][INFO ][o.e.t.n.NettyAllocator   ] [coord-node] creating NettyAllocator with the following configs: [ name=elasticsearch_configured, chunk_size=1mb, factors={"false"=>true}, suggested_max_allocation_size=1mb
[2026-05-29T12:07:57,643][INFO ][o.e.i.r.RecoverySettings ] [coord-node] using rate limit [100mb] with [ default=100mb, max=0b, read=0b, write=0b
[2026-05-29T12:07:57,671][INFO ][o.e.d.DiscoveryModule    ] [coord-node] using discovery type [multi-node] and seed hosts providers [settings, ec2]
[2026-05-29T12:07:58,337][INFO ][o.e.n.Node               ] [coord-node] initialized
[2026-05-29T12:07:58,338][INFO ][o.e.n.Node               ] [coord-node] starting ...
[2026-05-29T12:07:58,343][INFO ][o.e.x.d.l.DeprecationIndexingComponent] [coord-node] deprecation component started
[2026-05-29T12:07:58,423][INFO ][o.e.t.TransportService   ] [coord-node] publish_address {<PRIVATE_IP>:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}, {<PRIVATE_IP>:9300}
[2026-05-29T12:07:58,512][INFO ][o.e.b.BootstrapChecks    ] [coord-node] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2026-05-29T12:07:58,515][INFO ][o.e.c.c.ClusterBootstrapService] [coord-node] this node has not joined a bootstrapped cluster yet; [cluster.initial_master_nodes] is set to []
[2026-05-29T12:08:08,524][WARN ][o.e.c.c.ClusterFormationFailureHelper] [coord-node] master not discovered yet: have discovered [{[coord-node]}{<ID>}{<ID>}{[coord-node]}{<PRIVATE_IP>}{<PRIVATE_IP>:9300}{r}{8.8.2}, {[master-node]}{<ID>}{<ID>}{[master-node]}{<PRIVATE_IP>}{<PRIVATE_IP>:9300}{m}{8.8.2}, {[master-node]}{<ID>}{<ID>}{[master-node]}{<PRIVATE_IP>}{<PRIVATE_IP>:9300}{m}{8.8.2}, {[master-node]}{<ID>}{<ID>}{[master-node]}{<PRIVATE_IP>}{<PRIVATE_IP>:9300}{m}{8.8.2}]; discovery will continue using [<HOSTS_PROVIDER_IPS_OMITTED>] from hosts providers and [] from last-known cluster state; node term 126, last-accepted version 0 in term 0; joining [{[master-node]}{<ID>}{<ID>}{[master-node]}{<PRIVATE_IP>}{<PRIVATE_IP>:9300}{m}{8.8.2}] in term [126] has status [waiting for response] after [5.8s/5809ms]; for troubleshooting guidance, see https://www.elastic.co/guide/en/elasticsearch/reference/8.8/discovery-troubleshooting.html
[2026-05-29T12:08:18,526][WARN ][o.e.c.c.ClusterFormationFailureHelper] [coord-node] master not discovered yet: have discovered [{[coord-node]}{<ID>}{<ID>}{[coord-node]}{<PRIVATE_IP>}{<PRIVATE_IP>:9300}{r}{8.8.2}, {[master-node]}{<ID>}{<ID>}{[master-node]}{<PRIVATE_IP>}{<PRIVATE_IP>:9300}{m}{8.8.2}, {[master-node]}{<ID>}{<ID>}{[master-node]}{<PRIVATE_IP>}{<PRIVATE_IP>:9300}{m}{8.8.2}, {[master-node]}{<ID>}{<ID>}{[master-node]}{<PRIVATE_IP>}{<PRIVATE_IP>:9300}{m}{8.8.2}]; discovery will continue using [<HOSTS_PROVIDER_IPS_OMITTED>] from hosts providers and [] from last-known cluster state; node term 126, last-accepted version 0 in term 0; joining [{[master-node]}{<ID>}{<ID>}{[master-node]}{<PRIVATE_IP>}{<PRIVATE_IP>:9300}{m}{8.8.2}] in term [126] has status [waiting for response] after [15.8s/15822ms]; for troubleshooting guidance, see https://www.elastic.co/guide/en/elasticsearch/reference/8.8/discovery-troubleshooting.html
[2026-05-29T12:08:28,523][WARN ][o.e.n.Node               ] [coord-node] timed out while waiting for initial discovery state - timeout: 30s
[2026-05-29T12:08:28,527][WARN ][o.e.c.c.ClusterFormationFailureHelper] [coord-node] master not discovered yet: have discovered [{[coord-node]}{<ID>}{<ID>}{[coord-node]}{<PRIVATE_IP>}{<PRIVATE_IP>:9300}{r}{8.8.2}, {[master-node]}{<ID>}{<ID>}{[master-node]}{<PRIVATE_IP>}{<PRIVATE_IP>:9300}{m}{8.8.2}, {[master-node]}{<ID>}{<ID>}{[master-node]}{<PRIVATE_IP>}{<PRIVATE_IP>:9300}{m}{8.8.2}, {[master-node]}{<ID>}{<ID>}{[master-node]}{<PRIVATE_IP>}{<PRIVATE_IP>:9300}{m}{8.8.2}]; discovery will continue using [<HOSTS_PROVIDER_IPS_OMITTED>] from hosts providers and [] from last-known cluster state; node term 126, last-accepted version 0 in term 0; joining [{[master-node]}{<ID>}{<ID>}{[master-node]}{<PRIVATE_IP>}{<PRIVATE_IP>:9300}{m}{8.8.2}] in term [126] has status [waiting for response] after [25.8s/25830ms]; for troubleshooting guidance, see https://www.elastic.co/guide/en/elasticsearch/reference/8.8/discovery-troubleshooting.html
[2026-05-29T12:08:28,534][INFO ][o.e.h.AbstractHttpServerTransport] [coord-node] publish_address {<PRIVATE_IP>:9200}, bound_addresses {[::1]:9200}, {127.0.0.1:9200}, {<PRIVATE_IP>:9200}
[2026-05-29T12:08:28,535][INFO ][o.e.n.Node               ] [coord-node] started {[coord-node]}{<ID>}{<ID>}{[coord-node]}{<PRIVATE_IP>}{<PRIVATE_IP>:9300}{r}{8.8.2}{zone=<ZONE>, xpack.installed=true}
[2026-05-29T12:08:29,700][WARN ][r.suppressed             ] [coord-node] path: /_stats/store,indexing,search, params: { format=json, metric=store} indexing search org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
[2026-05-29T12:08:32,239][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
[2026-05-29T12:08:32,543][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { allow_partial_search_results=false, index=[redacted-index], routing=<REDACTED>, timeout=10s, track_total_hits=false} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
[2026-05-29T12:08:32,664][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { allow_partial_search_results=false, index=[redacted-index], routing=<REDACTED>, timeout=10s, track_total_hits=false} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
[2026-05-29T12:08:32,889][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { allow_partial_search_results=false, index=[redacted-index], routing=<REDACTED>, timeout=10s, track_total_hits=false} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
[2026-05-29T12:08:32,997][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
# omitted 127 additional /_search params:{} block lines in second 2026-05-29T12:08:32
# omitted 1 additional /[redacted-index]/_search block lines in second 2026-05-29T12:08:32
[2026-05-29T12:08:33,045][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
[2026-05-29T12:08:33,054][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
[2026-05-29T12:08:33,142][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { allow_partial_search_results=false, index=[redacted-index], routing=<REDACTED>, timeout=10s, track_total_hits=false} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
[2026-05-29T12:08:33,209][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { allow_partial_search_results=false, index=[redacted-index], routing=<REDACTED>, timeout=10s, track_total_hits=10000} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
[2026-05-29T12:08:33,957][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { allow_partial_search_results=false, index=[redacted-index], routing=<REDACTED>, timeout=10s, track_total_hits=false} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
[2026-05-29T12:08:33,995][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
# omitted 10 additional /[redacted-index]/_search block lines in second 2026-05-29T12:08:33
# omitted 241 additional /_search params:{} block lines in second 2026-05-29T12:08:33
[2026-05-29T12:08:34,018][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
[2026-05-29T12:08:34,025][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
[2026-05-29T12:08:34,267][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { ignore_unavailable=true, index=[redacted-index], routing=<REDACTED>, timeout=30s} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
[2026-05-29T12:08:34,362][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { allow_partial_search_results=false, index=[redacted-index], routing=<REDACTED>, timeout=10s, track_total_hits=false} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
[2026-05-29T12:08:34,977][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
[2026-05-29T12:08:34,983][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { allow_partial_search_results=false, index=[redacted-index], routing=<REDACTED>, timeout=10s, track_total_hits=false} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
# omitted 277 additional /_search params:{} block lines in second 2026-05-29T12:08:34
# omitted 11 additional /[redacted-index]/_search block lines in second 2026-05-29T12:08:34
[2026-05-29T12:08:35,004][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
[2026-05-29T12:08:35,007][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
[2026-05-29T12:08:35,043][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { allow_partial_search_results=false, index=[redacted-index], routing=<REDACTED>, timeout=10s, track_total_hits=false} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
[2026-05-29T12:08:35,068][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { ignore_unavailable=true, index=[redacted-index], routing=<REDACTED>, timeout=30s} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
[2026-05-29T12:08:35,963][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { allow_partial_search_results=false, index=[redacted-index], routing=<REDACTED>, timeout=10s, track_total_hits=false} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
[2026-05-29T12:08:35,978][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
# omitted 318 additional /_search params:{} block lines in second 2026-05-29T12:08:35
# omitted 11 additional /[redacted-index]/_search block lines in second 2026-05-29T12:08:35
[2026-05-29T12:08:36,047][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
[2026-05-29T12:08:36,166][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { allow_partial_search_results=false, index=[redacted-index], routing=<REDACTED>, timeout=10s, track_total_hits=10000} [redacted-name] org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
[2026-05-29T12:08:36,232][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { ignore_unavailable=true, index=[redacted-index], routing=<REDACTED>, timeout=30s} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
[2026-05-29T12:08:36,890][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { allow_partial_search_results=false, index=[redacted-index], routing=<REDACTED>, timeout=10s, track_total_hits=false} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
[2026-05-29T12:08:36,990][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
# omitted 12 additional /[redacted-index]/_search block lines in second 2026-05-29T12:08:36
# omitted 154 additional /_search params:{} block lines in second 2026-05-29T12:08:36
[2026-05-29T12:08:37,006][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
[2026-05-29T12:08:37,010][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { allow_partial_search_results=false, index=[redacted-index], routing=<REDACTED>, timeout=10s, track_total_hits=false} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
[2026-05-29T12:08:37,014][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
[2026-05-29T12:08:37,512][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { ignore_unavailable=true, index=[redacted-index], routing=<REDACTED>, timeout=30s} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
[2026-05-29T12:08:37,997][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: {index=[redacted-index]} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
[2026-05-29T12:08:37,998][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
# omitted 88 additional /_search params:{} block lines in second 2026-05-29T12:08:37
# omitted 11 additional /[redacted-index]/_search block lines in second 2026-05-29T12:08:37
[2026-05-29T12:08:38,007][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
[2026-05-29T12:08:38,008][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
[2026-05-29T12:08:38,015][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { allow_partial_search_results=false, index=[redacted-index], routing=<REDACTED>, timeout=10s, track_total_hits=false} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
[2026-05-29T12:08:38,017][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { allow_partial_search_results=false, index=[redacted-index], routing=<REDACTED>, timeout=10s, track_total_hits=false} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
[2026-05-29T12:08:38,898][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { allow_partial_search_results=false, index=[redacted-index], routing=<REDACTED>, timeout=10s, track_total_hits=false} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
[2026-05-29T12:08:38,999][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
# omitted 74 additional /_search params:{} block lines in second 2026-05-29T12:08:38
# omitted 16 additional /[redacted-index]/_search block lines in second 2026-05-29T12:08:38
[2026-05-29T12:08:39,011][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
[2026-05-29T12:08:39,053][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
[2026-05-29T12:08:39,096][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { allow_partial_search_results=false, index=[redacted-index], routing=<REDACTED>, timeout=10s, track_total_hits=false} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
[2026-05-29T12:08:39,260][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { allow_partial_search_results=false, index=[redacted-index], routing=<REDACTED>, timeout=10s, track_total_hits=false} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
[2026-05-29T12:08:39,568][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { allow_partial_search_results=false, index=[redacted-index], routing=<REDACTED>, timeout=10s, track_total_hits=false} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
[2026-05-29T12:08:39,611][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
# omitted 9 additional /_search params:{} block lines in second 2026-05-29T12:08:39
[2026-05-29T12:08:40,045][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { allow_partial_search_results=false, index=[redacted-index], routing=<REDACTED>, timeout=10s, track_total_hits=false} [redacted-name] org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
[2026-05-29T12:08:40,378][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
[2026-05-29T12:08:40,401][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { allow_partial_search_results=false, index=[redacted-index], routing=<REDACTED>, timeout=10s, track_total_hits=false} [redacted-name] org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
[2026-05-29T12:08:40,475][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
[2026-05-29T12:08:40,991][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { allow_partial_search_results=false, index=[redacted-index], routing=<REDACTED>, timeout=10s, track_total_hits=false} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
[2026-05-29T12:08:40,997][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
# omitted 4 additional /[redacted-index]/_search block lines in second 2026-05-29T12:08:40
# omitted 11 additional /_search params:{} block lines in second 2026-05-29T12:08:40
[2026-05-29T12:08:41,005][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { allow_partial_search_results=false, index=[redacted-index], routing=<REDACTED>, timeout=10s, track_total_hits=false} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
[2026-05-29T12:08:41,006][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { allow_partial_search_results=false, index=[redacted-index], routing=<REDACTED>, timeout=10s, track_total_hits=false} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
[2026-05-29T12:08:41,008][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
[2026-05-29T12:08:41,010][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
[2026-05-29T12:08:41,973][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
[2026-05-29T12:08:41,999][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { allow_partial_search_results=false, index=[redacted-index], routing=<REDACTED>, timeout=10s, track_total_hits=false} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
# omitted 32 additional /[redacted-index]/_search block lines in second 2026-05-29T12:08:41
# omitted 76 additional /_search params:{} block lines in second 2026-05-29T12:08:41
[2026-05-29T12:08:42,073][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
[2026-05-29T12:08:42,104][WARN ][r.suppressed             ] [coord-node] path: /[redacted-index]/_search, params: { allow_partial_search_results=false, index=[redacted-index], routing=<REDACTED>, timeout=10s, track_total_hits=false} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: SERVICE_UNAVAILABLE/1/state not recovered / initialized ;
[2026-05-29T12:08:42,131][INFO ][o.e.c.s.ClusterSettings  ] [coord-node] updating [cluster.routing.allocation.cluster_concurrent_rebalance] from [2] to [32]
[2026-05-29T12:08:42,132][INFO ][o.e.c.s.ClusterSettings  ] [coord-node] updating [cluster.routing.allocation.balance.index] from [0.55] to [1.5]
[2026-05-29T12:08:42,132][INFO ][o.e.c.s.ClusterSettings  ] [coord-node] updating [cluster.routing.allocation.balance.shard] from [0.45] to [0.25]
[2026-05-29T12:08:42,132][INFO ][o.e.c.s.ClusterSettings  ] [coord-node] updating [cluster.routing.allocation.node_concurrent_incoming_recoveries] from [30] to [10]
[2026-05-29T12:08:42,132][INFO ][o.e.c.s.ClusterSettings  ] [coord-node] updating [cluster.routing.allocation.node_concurrent_outgoing_recoveries] from [30] to [16]
[2026-05-29T12:08:42,132][INFO ][o.e.c.s.ClusterSettings  ] [coord-node] updating [cluster.routing.allocation.node_initial_primaries_recoveries] from [4] to [16]
[2026-05-29T12:08:42,132][INFO ][o.e.c.s.ClusterSettings  ] [coord-node] updating [indices.id_field_data.enabled] from [false] to [true]
[2026-05-29T12:08:42,132][INFO ][o.e.c.s.ClusterSettings  ] [coord-node] updating [indices.recovery.max_concurrent_file_chunks] from [2] to [8]
[2026-05-29T12:08:42,132][INFO ][o.e.c.s.ClusterSettings  ] [coord-node] updating [indices.recovery.max_concurrent_operations] from [1] to [4]
[2026-05-29T12:08:42,132][INFO ][o.e.c.s.ClusterSettings  ] [coord-node] updating [search.default_search_timeout] from [-1] to [60s]
[2026-05-29T12:08:42,132][WARN ][r.suppressed             ] [coord-node] path: /_search, params: {} org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
[2026-05-29T12:08:42,137][INFO ][o.e.x.s.a.TokenService   ] [coord-node] refresh keys
[2026-05-29T12:08:42,322][INFO ][o.e.x.s.a.TokenService   ] [coord-node] refreshed keys
[2026-05-29T12:08:43,764][INFO ][o.e.c.s.ClusterApplierService] [coord-node] added {{[other-router]}{<ID>}{<ID>}{[other-router]}{<PRIVATE_IP>}{<PRIVATE_IP>:9300}{r}{8.8.2}}, term: 126, version: 7701255, reason: ApplyCommitRequest{term=126, version=7701255, sourceNode={[master-node]}{<ID>}{<ID>}{[master-node]}{<PRIVATE_IP>}{<PRIVATE_IP>:9300}{m}{8.8.2}{zone=<ZONE>, xpack.installed=true}}

Thanks, that helps a lot. The problem is here:

This node is taking well over the usual 30s default timeout to join the cluster. As a temporary workaround you can try setting discovery.initial_state_timeout: 10m (or maybe even longer) but really this indicates some more fundamental problems on your master node as it shouldn't take so long to join a node into the cluster.

Thanks, this is super helpful. A few follow-ups:

  1. We could not find discovery.initial_state_timeout in the public 8.8 ES docs, although we can see it in the 8.8 source. Is it a supported static node setting that should be set in elasticsearch.yml and applied via node restart?
  2. As a temporary mitigation, is the right mental model that this only keeps the node waiting longer before starting HTTP on 9200, rather than actually fixing readiness?
    1. Earlier you mentioned readiness.port. From the 8.8 source it looks like, on a non-master node, that port only opens once the local cluster state sees an elected master. Would that be the better fix to steer towards here? The reason I ask is that increasing discovery.initial_state_timeout still leaves the same fallback if a future join takes longer than whatever timeout we pick.
  3. Our current admission path uses HTTP/L7 health checks. readiness.port appears to be TCP-only. Is there any recommended node-local HTTP/L7 check with similar semantics, or is the intended pattern to switch the health check to a TCP probe on readiness.port?
  4. More broadly, this cluster is fairly large (hundreds of nodes overall). Is a >30s join delay ever expected at that scale, or should we treat it as a sign of master-side pressure? If so, what are the first things you would inspect, and are there any docs on diagnosing slow joins / slow cluster-state publication?

Huh that's a good point. It's very rare to need to change this setting, I guess nobody has noticed this before now.

Yes.

It was in fact you who first mentioned this here. I didn't think this was a good idea because it was only intended for internal use on Elastic Cloud, but it's documented here as a tech preview feature (and here in 8.19 as a GA feature) so I guess it's fine to use.

That's how we do it within Elastic Cloud.

Not really, although 8.8 is very old and I don't keep track of bugs we've fixed since then. The troubleshooting in this area is likely easier in 8.19 (and easier still in 9.x) so if it's not obvious from the logs what's going on with the master then your best bet will be to upgrade.

Thanks for answering all the queries patiently.

One clarification on our environment: we self-host Elasticsearch on dedicated host sets and are not using Elastic Cloud, so I want to make sure the same guidance applies there as well.

For 8.8.2 specifically, if we increase discovery.initial_state_timeout, is that the right stopgap for now, or would you also recommend changing the external LB health probe on that version? Since we need to stay on 8.8.2 for the time being, what would you consider the light -weight HTTP/L7 probe for coordinator-only nodes behind an external LB? GET / seems too shallow, but a real /_search seems heavier/riskier than ideal for a health probe.

Longer term, once we can upgrade, we can evaluate readiness.port. Since that is TCP-based, is the long-term answer simply to use that dedicated TCP readiness check for coordinator nodes behind the LB?

Separately, I will investigate why cluster join / local cluster-state application is delayed in our environment.

Yeah I think it's ok to move to readiness.port even on 8.8. I advised against it earlier because I thought it was an internal-only feature but the docs suggest otherwise. I don't believe this feature changed significantly before going GA. You do need to move off 8.8 ASAP too, this version is no longer maintained.