Readiness check for Elasticsearch coordinator nodes behind a load balancer?

Thanks, this is very helpful.
We have one practical constraint on our side: our LB health checking supports HTTP or gRPC health checks, but not raw TCP. We agree that plain GET / on 9200 is too shallow here, since it can go healthy before a coordinator-only node is actually ready to serve search traffic after restart.
Does Elasticsearch support any HTTP health check with semantics similar to readiness port for this case? If not, what would you recommend as the safest approach here since the load balancer can only do HTTP or gRPC health checks?

One fallback we are considering is:

  • use readiness.port directly wherever TCP health checks are available
  • where only HTTP checks are possible, use a small localhost-only health wrapper on each coordinator node that reports healthy only if it can connect to readiness.port

Please let me know your thoughts about this approach. We want to make sure we do not build infrastructure glue around readiness.port if there is a better Elasticsearch-supported approach.

Been reading this thread with interest.

I find it slightly curious you don’t seem overly concerned that your node, maybe nodes, are taking so long to join the cluster. For me, I’d be focusing on that with priority as it might indicate a deeper issue. It’s unexpected and currently unexplained. Without that issue, would this thread even exist ? Also, very possibly aspects of this are already addressed in more up to date releases.

It’s obviously up to you, you can curate things as you wish. But complexity is very often the enemy of maintainability, and therefore reliability .

Good luck, was an interesting thread.

Thanks for the suggestion. I agree we need to fix the deeper node issues.

Since it can take sometime to deep dive into it, I was looking towards finding an immediate solution to avoid any probable production impact in near future.

No. The closest alternative I can think of is GET / with a longer timeout.

How many times have you seen the issue? How often are you doing scaleouts that are adding multiple new coordinator nodes simultaneously? You see it other scenarios , outside the nodes slowness in joining the cluster?

If you do deep dive and eg think you hit an Elasticsearch bug, there seems good chance you’d be asked to upgrade anyways to get any potential fix, or it may have been fixed in interim releases already. Or you can try to adapt procedures so that the triggering scenarios are less likely.

Again, not saying any specific way is right or wrong. I’m not in your shoes. Your boss is not my boss :rofl:

@DavidTurner Thanks. When you say GET / with a longer timeout, do you mean a longer HTTP health-check timeout for the GET / probe itself, or a longer Elasticsearch startup wait for initial cluster state?

I want to make sure I understand the failure mode correctly before relying on this as a production readiness signal.

@RainTown We have seen this issue only when we had to scale up multiple router nodes at once (once a month/quarter or so). This is mainly for reindexing purposes. Apart from this we have never faced this.
Cluster upgrade is on our pipeline but towards the end of the year. The clusters are pretty huge and serving variety of use cases. So it needs to be done carefully.

Thanks for the clarity.

My view is probably clear by now and I don’t want to labour it (probably too late!) - you have processes that potentially cause you issues, so adapt your processes. There’s no law saying you have to add multiple nodes all at once.

Everything here seems in ā€œworkaroundā€ space, rather than ā€œsolutionā€ space. So views on what’s best will naturally vary.

Reminds me of old joke:

The patient: Doctor, Doctor - it hurts when I do this."
The doctor: "Don't do it then!!

I mean setting discovery.initial_state_timeout: 10m (or maybe even longer) so that Elasticsearch more reliably waits for the initial cluster state before opening the HTTP port.

Thanks David.
One alternate solution we are considering is this: for load-balancing layers that only support HTTP health checks, run a small node-local HTTP health endpoint on the coordinator node that reports healthy only if it can connect to readiness.port on localhost.

I realize this would be infrastructure glue on our side rather than an Elasticsearch feature. From Elasticsearch’s point of view, would this be a reasonable way to preserve the intended readiness semantics of readiness.port, or would you see any problem with that approach?

I don't see how Elasticsearch could even tell this is what you're doing :slight_smile: Seems like a bit of a hack but if there's really no alternative then it sounds like it'd work.

Thanks, this is helpful. One more piece of context from our investigation, plus a follow-up question.

In our setup, bringing up a newly provisioned node is a two-step process. The host first goes through the normal bootstrap/provisioning path, then performs a planned reboot before the node is fully settled. On the clearest failing coordinating node, Elasticsearch first started during that initial bootstrap window, then shut down, and then started again. That means Elasticsearch was allowed to start before the host had fully completed its bring-up sequence. During the incident we were also scaling out 24 coordinating-only nodes in a cluster with several hundred nodes, so multiple fresh nodes were still in reboot and restart churn while trying to join.

At the same time, the elected master was delayed publishing cluster state, and its pending_tasks_total rose materially in the same window, from 22 to 64 to 77. The same follower data nodes the master was waiting on were also logging same-window transport failures to coordinating nodes, so our current hypothesis is that the reboot/restart churn from that fresh-node wave amplified cluster-state coordination backlog.

Five affected coordinating nodes then did not receive usable cluster state before their local discovery.initial_state_timeout of 30s elapsed, exposed 9200, and were admitted before it was actually ready. We have addressed the startup-ordering issue on our side by gating Elasticsearch startup on bootstrap completion.

I wanted to confirm one point about readiness.port. For a coordinator node, is it effectively tied to whether the node currently sees an elected master in its local cluster state? If so, during a brief no-master or master-election window, is it expected that coordinating-only nodes become unready until a master is visible again?

If that understanding is correct, what would you recommend as the right LB admission signal for coordinating-only nodes in a read-heavy deployment? Is readiness.port still the preferred signal, or is there a better readiness pattern for that case?