Hi all,
We have been experience some slow degradation in performance in our cluster (ES v2.3.3). Looking at metrics and stats, one of the things I noticed is the constantly increasing number of total_opened http connections. I know this symptoms usually points to a client not using persistent/keep-alive connections. We use the official Ruby client, Puma and we are using the Patron Faraday adapter, so on that end we should be covered in the keep-alive configuration. Although, we are on AWS (using the cloud plugin, our own ES EC2 hosts) and we have an ELB over 2 client nodes as the entry point for our app. We suspected that the ELB would be not honoring the keep-alive connections from the service, so we completely removed the ELB from the picture and connect directly to one of the hosts, letting ES do the LB. To our surprise the total_opened connections keep increasing...we even shutdown the Ruby service and observed the behavior, so that points out to another variable causing HTTP connections to be opened improperly. If anybody have seen this behavior, I will point out the following details (maybe somebody went through this and see an obvious offending suspect):
- Monitoring: Ping/deep ping calls sent by the ELB (although we tested our access with direct access to the nodes and no ELB, wondering if AWS ELB's are known for some bad behavior)
- Intra-node communication (we have 3 dedicated masters, 2 client nodes behind the ELB and 5 bulkier data nodes). I suspect this is TCP and would not be a suspect at all or everybody would experience the same.
- Plugins: We have Marvel, Kibana and Head installed. Maybe one of them rings a bell. Anyone ?
I don't know where else to look. If anybody have any suggestions on how to debug this issue and why total_opened count would be ever growing aside from the client keep-alive setup, I would greatly appreciate...Restarting the cluster fix the problem but it starts to slowly degrade again over time.
Cheers,
Rodrigo