Stats endpoint for checking a single node health


#1

I'm facing an elasticsearch cluster with 6 nodes and a load balancer that sends requests to those nodes. The load balancer needs to query the nodes periodically in order to check their healtcheck (if they are alive ready to serve requests). For that purpose, I was using the _nodes/stats endpoint with a timeout of 2 seconds.

With this configuration, it happened to me that when a node went down, the load balancer marked all the nodes as unhealthy instead of only the one that was failing. This made me think that the _nodes/stats endpoint internally, does a query to all the nodes in the cluster and since one of them was not responding, every healthcheck was failing.

Looking at the docs (https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-nodes-stats.html) I saw the _nodes/_local/stats endpoint which requests should be resolved on the node that gets the request without asking anything to the other nodes.

My question is... The assumption I did about thinking that the _nodes/stats endpoint was querying all the nodes internally is correct? In that case, changing the endpoint to the _local one should solve the issue.

Then, is this the proper way to get the health of a single node of elasticsearch? I only want to know if this node is able to reply requests.


(Mark Walkom) #2

Yes it does. Try _nodes/_local as you mention - https://www.elastic.co/guide/en/elasticsearch/reference/5.4/cluster.html#cluster-nodes

Otherwise use the high level IP:9200, if a node replies then it's ok, if not then something is up/


(pau freixes) #3

Yeps, We changed the strategy to the URL proposed by @warkolm __nodes/local/stats/http and we had the same issue, one node went down for an unknown reason and all of the other nodes couldnt reply to the load balancer during a short period of time - btw 10 and 20 seconds.

However, other regular requests like the search one continued working like a charm.

could the URL __nodes/local/stats/http has some internal communications that involves all of the nodes in the cluster?

What do you recommend, check to another network level such as TCP ? is it enough trustable to infer that a node is up and healthy?


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.