Readiness check for Elasticsearch coordinator nodes behind a load balancer?

Hello,

I am trying to choose the right readiness check for Elasticsearch coordinator nodes behind a load balancer / proxy.

We are on Elasticsearch 8.8.2. The coorinator nodes serve search traffic behind a load balancer. The current health check is shallow: TCP on port 9200 and/or GET /.

I found older guidance suggesting that GET / is enough for a load balancer health check. In my case, that seems true for liveness, but not necessarily for readiness. What I observed during node restart / bootstrap is:

  • the node starts accepting TCP on :9200
  • GET / returns 200
  • but for a short window, real search requests can still fail with:
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];

So my question is: what is the recommended corrdinator node-local readiness check if I want to avoid routing live search traffic to a node until it has rejoined / recovered enough to serve /_search successfully?
I do not want a cluster-wide check that could mark all nodes unhealthy just because the cluster is yellow/red elsewhere.

Welcome to the forum @sagar_cenation

I don't know best "readiness" check logic/settings for your LB. Hopefully someone else can advise there.

yellow and red are completely different animals. red is ... bad.

yellow has a specific meaning, one or more index's shards is currently missing a replica. But all indices should be writeable/searchable if cluster state is yellow, as (eg) happens naturally when a rolling restart is ongoing.

org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];

As far as I knew, that exception should not appear when the cluster state is green or yellow? Am I just wrong on this? node N might generate this exception even after node M has returned "cluster is yellow/green" ? I am happy to be corrected if so, every day is a school day!

I note in passing that the /_cluster/health endpoint has ?wait_for_status=X arg, where X can be green or yellow, and there is also the per-index call as well, /_cluster/health/index-name, and should only return a 200 if the desired state or better has been reached.

Thanks for reverting back.
that exception occurred not during a cluster status being changed but sudden addition of multiple router nodes

This is the current scenario:
The clients connects to the coordinator nodes via envoy and uses http healthcheck GET / on 9200 port to check if the nodes are healthy to serve the traffic. What can we update here?

Thanks for clarifying.

As I said, I am not able to confirm your best options with elastic behind your (envoy) load balancer, and even if it were haproxy or another LB I still would not know, especially given sudden and significant changes in cluster topology. You are certainly right that a GET to / returning 200 does not mean "ready for indexing/querying/....".

I hope someone else can assist, and I wish you luck. Grateful for a well written problem description too.