Reusing existing HTTP connections

Hello everyone,

We're currently testing our new ES solution and have found that it isn't reusing existing connections. The HTTP total_opened value (GET _nodes/stats/http) is consistently increasing as we do our indexing/run queries. Current values for our 3 nodes are below:

node-1
"current_open": 1,
"total_opened": 462334
node-2
"current_open": 0,
"total_opened": 363
node-3
"current_open": 0,
"total_opened": 374

The current_open values move between 0 and 2 while total_opened increases. Also, node-1 looks to be receiving a lot higher percentage of traffic.

Here states:
If you see a very large total_opened number that is constantly increasing, that is a sure sign that one of your HTTP clients is not using keep-alive connections. Persistent, keep-alive connections are important for performance, since building up and tearing down sockets is expensive (and wastes file descriptors). Make sure your clients are configured appropriately.

I've updated our index settings to include KeepAlive ("keepAlive": "true") but it's made no difference.

We're using the .Net NEST library, a StaticConnectionPool, ES version 2.3 on Windows Server 2012 machines. The client is running as a singleton instance.

Does anyone know any ideas on how we can force our solution to use existing connections? Or what the repercussions of this might be?

Thanks in advance

Hi Cara,

Thanks for opening this issue: the default max connections is set to 10k which is a tad high for most cases. I've committed a change changing this default as well as an automated test to proof that NEST reuses connections, which it does.

You do not have to wait for a new 2.0 release though you can fix immediately by implementing your own httpconnection subclass

public class MyHttpConnection : HttpConnection
{
	protected override void AlterServicePoint(ServicePoint requestServicePoint, RequestData requestData)
	{
		requestServicePoint.ConnectionLimit = 80;
	}
}

And then make your ConnectionSettings use that instead:

var settings = new ConnectionSettings(connectionPool, new MyHttpConnection())
var client = new ElasticClient(settings)

Can you share a bit more about your setup? We have a lot of tests making sure we round robin.

Particulary interested in seeing how you instantiate StaticConnectionPool and ConnectionSettings

Thanks for this Martijn.

I've implemented this code but the total_opened count is still increasing. After the weekend the three nodes currently have counts of:
"total_opened": 748020
"total_opened": 1016793
"total_opened": 3538

Should we be alarmed by this? And is there anyway to force the connections to close?

Some of our instantiation code is as follows:

       var nodeUrls = GetNodeUrls();

        if (nodeUrls == null || !nodeUrls.Any())
        {
            throw new Exception("ElasticSearch Urls are not specified");
       }

        var pool = new StaticConnectionPool(nodeUrls);

        var connectionSettings = new ConnectionSettings(pool, new xxxHttpConnection())
            .DisableDirectStreaming()
            .ThrowExceptions(true);

    connectionSettings.DefaultIndex(defaultIndexName);
    var elasticClient = new ElasticClient(connectionSettings);

Nothing particularly standout-ish. Let me know if you'd like to see anymore.

Hi Cara that all looks good code wise.

The total opened still seems high though and I can not quite reproduce it locally.

TCP keep alive is enabled by default: https://github.com/elastic/elasticsearch-net/blob/master/src/Elasticsearch.Net/Connection/HttpConnection.cs#L83

and the defaults for both keepAliveTime and keepAliveInterval is 2 seconds:

see also: https://msdn.microsoft.com/en-us/library/system.net.servicepointmanager.settcpkeepalive(v=vs.110).aspx

Maybe the endpoints you are talking to require a more aggressive tcp keep alive packets interval?

You can set this on ConnectionSettings using EnableTcpKeepAlive

Thanks Martijn. I've updated this so it's a configurable value in our solution. I'll try a few different keep alive intervals and let you know how it goes.

I've been watching the TCP connections on node-1 and recording them at intervals over the last few hours. The ESTABLISHED connections between our ES nodes seem to stay consistent (the established connections have the same ports throughout the intervals) so it looks like they are persisting.

What strikes me as a bit strange are the connections in a state of TIME_WAIT. There are usually around 300. They all have a local address of itself (the node-1 machine) and an iterating port number. The foreign address is also always the node-1 machine but port 9200. Each time I've recorded these stats (even when less than 2 minutes apart) some of these connections have vanished (closed?) and new ones (still iterating through the port numbers) have appeared. It looks like something on the ES node is making requests to it's own ES instance. The thing of note is that the count of additional connections in TIME_WAIT state is consistent with the increase in the count of HTTP connections in the ES stats. Could something (Marvel?) be creating HTTP connections to the ES instance on the same machine and then the HTTP connection not being closed for some reason?

The total_opened http connection count has now reached 1.3 million on node-1.

Mind posting your relevant marvel (exporter) settings ? Retract any credentials/real urls of course :slight_smile:

Also on windows netstat -aonb will tell you which process owns which socket maybe there is some more information in there

I've been using the netstat -aonb command to export tcp connections - all the TIME_WAIT connections have a PID of 0 so is of no help.... I've stopped Kibana on one of the machines and the total_connections stopped increasing almost immediately.

Kibana settings are very basic - everything is commented out except:
server.port: 5601
server.host: "123.45.67.890"
elasticsearch.url: "http://123.45.67.890:9200"

We have no Marvel configs in our ES config file. Could this be the issue?

No the default for marvel is to do a local export which does not use HTTP at all so that rules out marvel.

123.45.67.890 is node-1 ? That might explain why node-1 is becoming a hotspot connection wise

That's just a dummy value - it's not the real one

good! question still stands though :slight_smile: is the elasticsearch.url configured in kibana pointing to node-1?

I turned Kibana off on node-2 as it's the one I need to monitor the least at the moment. All the nodes (including node-2) had an increase of total_opened that was far too high. Therefore the elasticsearch.url is set in the Kibana.yml to point to node-2 ES e.g.

elasticsearch.yml
node.name: node-2
network.host: 172.12.31.128

kibana.yml
server.port: 5601
server.host: "172.12.31.128"
elasticsearch.url: "http://172.12.31.128:9200"

Let me know if you need more settings.

Also, do you know if the total_opened value should drop? It isn't rising (now that Kibana is off) but it isn't dropping either.

No the total opened will not drop its a simple counter.

Are you able to restart the cluster and turn of kibana on all nodes and run your .NET application and see it behaves as intended?

That way the total_opened will reset and we can either rule out or focus on Kibana.

I've just turned Kibana off on node-1 too. Same result - the total_opened value isn't increasing anymore. The JVM Heap usage also dropped instantly to a satisfactory percentage (from 81% to 52%).

It's pretty clearly Kibana. There must be something configured incorrectly but I'm just not sure what. Or is this behaving as it should?

Hi Cara

I will ping one of my Kibana colleagues to see if something obvious pops up for them.

For posterity and future googlers I'd like to explain why NEST chose 10k as a default connection limit.

In .NET WebRequests have a ServicePoint property assigned to them and requests sharing the same host have the same servicepoint. Through ServicePointManager you can set the max connections per servicepoint (or hostname by proxy).

HttpWebRequest will first try and reuse a socket that is open before trying to allocate more:

https://referencesource.microsoft.com/#System/net/System/Net/_ConnectionGroup.cs,305

In our performance testing we found that it was very hard to get the max concurrent connections to exceed more then 70-100 from a single box. So we even played with manually assigning ServicePoint's from our own pool of ServicePoint's to webrequest so that you could have 2 or 3 different ServicePoints per host.

While this did pump up the maximum concurrent open persistent connections to a node it did nothing for overall read and write throughput from a single box.

The idea leaving it at 10k was that HttpClient already saturates to a sane true maximum state for concurrently open connections which might be different per machine.

I will leave the commit that dialed this back to a smaller constant so that it behaves more deterministic out of the box.

@Cara_B this sounds like a long-standing issue from the JavaScript client that Kibana uses. https://github.com/elastic/elasticsearch-js/issues/196

I'll try to verify the issue and get a fix out, but unfortunately it will take some time before it's available in Kibana (maybe we can get it into 5.1)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.