I noticed a strange behavior in the cluster of Elasticsearch (v1.5.2) that we run in the company. For some of the nodes the value of
process.open_file_descriptors) reported by Node Stats is going up with time until some limit value (over 1k of
http.current_open) is hit. When that happens
http.current_open drops to more normal, one digit value. Right after that the problem reappears on different node(s).
It's best to illustrate it with a graph (we use Elasticsearch StatsD plugin with push to Graphite):
First of all I haven't found what http.current_open really stands for. I assume that the number of open incoming TCP connection for it's HTTP transport (listening on port 9200 by default). The problem is that these numbers don't match:
I'm first identifying the node with most
jq -r '.host + " " + (.http.current_open|tostring)' | sort -rnk 2 | head -n 1 some.host.name 1254```
Then I'm logging to this host checking and counting these connections myself:
Can someone please explain to me why these numbers don't match?
I'd also like to understand why does it happen. Our apps access the cluster via HAProxy (using round-robin) if that changes anything. It's probably worth to mention that the number of TCP connections reported by HAProxy don't match