(Tomasz Elendt) #1


I noticed a strange behavior in the cluster of Elasticsearch (v1.5.2) that we run in the company. For some of the nodes the value of http.current_open (and process.open_file_descriptors) reported by Node Stats is going up with time until some limit value (over 1k of http.current_open) is hit. When that happens http.current_open drops to more normal, one digit value. Right after that the problem reappears on different node(s).

It's best to illustrate it with a graph (we use Elasticsearch StatsD plugin with push to Graphite):

First of all I haven't found what http.current_open really stands for. I assume that the number of open incoming TCP connection for it's HTTP transport (listening on port 9200 by default). The problem is that these numbers don't match:

  1. I'm first identifying the node with most http.current_open:

    jq -r '.host + " " + (.http.current_open|tostring)' | sort -rnk 2 | head -n 1 1254```
  2. Then I'm logging to this host checking and counting these connections myself:


Can someone please explain to me why these numbers don't match?
I'd also like to understand why does it happen. Our apps access the cluster via HAProxy (using round-robin) if that changes anything. It's probably worth to mention that the number of TCP connections reported by HAProxy don't match http.current_open either.


(Tomasz Elendt) #2


I just wanted to write that we found out why http.current_open were piling up - it was caused by our custom ES plugin (which was blocking HTTP server thread). As for the http.current_open not matching TCP connections at port 9200 - my guess is that's because of HTTP pipelining.

(system) #3