I am trying to choose the right readiness check for Elasticsearch coordinator nodes behind a load balancer / proxy.
We are on Elasticsearch 8.8.2. The coorinator nodes serve search traffic behind a load balancer. The current health check is shallow: TCP on port 9200 and/or GET /.
I found older guidance suggesting that GET / is enough for a load balancer health check. In my case, that seems true for liveness, but not necessarily for readiness. What I observed during node restart / bootstrap is:
the node starts accepting TCP on :9200
GET / returns 200
but for a short window, real search requests can still fail with:
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
So my question is: what is the recommended corrdinator node-local readiness check if I want to avoid routing live search traffic to a node until it has rejoined / recovered enough to serve /_search successfully?
I do not want a cluster-wide check that could mark all nodes unhealthy just because the cluster is yellow/red elsewhere.
I don't know best "readiness" check logic/settings for your LB. Hopefully someone else can advise there.
yellow and red are completely different animals. red is ... bad.
yellow has a specific meaning, one or more index's shards is currently missing a replica. But all indices should be writeable/searchable if cluster state is yellow, as (eg) happens naturally when a rolling restart is ongoing.
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
As far as I knew, that exception should not appear when the cluster state is green or yellow? Am I just wrong on this? node N might generate this exception even after node M has returned "cluster is yellow/green" ? I am happy to be corrected if so, every day is a school day!
I note in passing that the /_cluster/health endpoint has ?wait_for_status=X arg, where X can be green or yellow, and there is also the per-index call as well, /_cluster/health/index-name, and should only return a 200 if the desired state or better has been reached.
Thanks for reverting back.
that exception occurred not during a cluster status being changed but sudden addition of multiple router nodes
This is the current scenario:
The clients connects to the coordinator nodes via envoy and uses http healthcheck GET / on 9200 port to check if the nodes are healthy to serve the traffic. What can we update here?
As I said, I am not able to confirm your best options with elastic behind your (envoy) load balancer, and even if it were haproxy or another LB I still would not know, especially given sudden and significant changes in cluster topology. You are certainly right that a GET to / returning 200 does not mean "ready for indexing/querying/....".
I hope someone else can assist, and I wish you luck. Grateful for a well written problem description too.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.