We're currently testing our new ES solution and have found that it isn't reusing existing connections. The HTTP total_opened value (GET _nodes/stats/http) is consistently increasing as we do our indexing/run queries. Current values for our 3 nodes are below:
The current_open values move between 0 and 2 while total_opened increases. Also, node-1 looks to be receiving a lot higher percentage of traffic.
Here states: If you see a very large total_opened number that is constantly increasing, that is a sure sign that one of your HTTP clients is not using keep-alive connections. Persistent, keep-alive connections are important for performance, since building up and tearing down sockets is expensive (and wastes file descriptors). Make sure your clients are configured appropriately.
I've updated our index settings to include KeepAlive ("keepAlive": "true") but it's made no difference.
We're using the .Net NEST library, a StaticConnectionPool, ES version 2.3 on Windows Server 2012 machines. The client is running as a singleton instance.
Does anyone know any ideas on how we can force our solution to use existing connections? Or what the repercussions of this might be?
Thanks for opening this issue: the default max connections is set to 10k which is a tad high for most cases. I've committed a change changing this default as well as an automated test to proof that NEST reuses connections, which it does.
You do not have to wait for a new 2.0 release though you can fix immediately by implementing your own httpconnection subclass
public class MyHttpConnection : HttpConnection
{
protected override void AlterServicePoint(ServicePoint requestServicePoint, RequestData requestData)
{
requestServicePoint.ConnectionLimit = 80;
}
}
And then make your ConnectionSettings use that instead:
var settings = new ConnectionSettings(connectionPool, new MyHttpConnection())
var client = new ElasticClient(settings)
Can you share a bit more about your setup? We have a lot of tests making sure we round robin.
Particulary interested in seeing how you instantiate StaticConnectionPool and ConnectionSettings
I've implemented this code but the total_opened count is still increasing. After the weekend the three nodes currently have counts of:
"total_opened": 748020
"total_opened": 1016793
"total_opened": 3538
Should we be alarmed by this? And is there anyway to force the connections to close?
Some of our instantiation code is as follows:
var nodeUrls = GetNodeUrls();
if (nodeUrls == null || !nodeUrls.Any())
{
throw new Exception("ElasticSearch Urls are not specified");
}
var pool = new StaticConnectionPool(nodeUrls);
var connectionSettings = new ConnectionSettings(pool, new xxxHttpConnection())
.DisableDirectStreaming()
.ThrowExceptions(true);
connectionSettings.DefaultIndex(defaultIndexName);
var elasticClient = new ElasticClient(connectionSettings);
Nothing particularly standout-ish. Let me know if you'd like to see anymore.
Thanks Martijn. I've updated this so it's a configurable value in our solution. I'll try a few different keep alive intervals and let you know how it goes.
I've been watching the TCP connections on node-1 and recording them at intervals over the last few hours. The ESTABLISHED connections between our ES nodes seem to stay consistent (the established connections have the same ports throughout the intervals) so it looks like they are persisting.
What strikes me as a bit strange are the connections in a state of TIME_WAIT. There are usually around 300. They all have a local address of itself (the node-1 machine) and an iterating port number. The foreign address is also always the node-1 machine but port 9200. Each time I've recorded these stats (even when less than 2 minutes apart) some of these connections have vanished (closed?) and new ones (still iterating through the port numbers) have appeared. It looks like something on the ES node is making requests to it's own ES instance. The thing of note is that the count of additional connections in TIME_WAIT state is consistent with the increase in the count of HTTP connections in the ES stats. Could something (Marvel?) be creating HTTP connections to the ES instance on the same machine and then the HTTP connection not being closed for some reason?
The total_opened http connection count has now reached 1.3 million on node-1.
I've been using the netstat -aonb command to export tcp connections - all the TIME_WAIT connections have a PID of 0 so is of no help.... I've stopped Kibana on one of the machines and the total_connections stopped increasing almost immediately.
Kibana settings are very basic - everything is commented out except: server.port: 5601 server.host: "123.45.67.890" elasticsearch.url: "http://123.45.67.890:9200"
We have no Marvel configs in our ES config file. Could this be the issue?
I turned Kibana off on node-2 as it's the one I need to monitor the least at the moment. All the nodes (including node-2) had an increase of total_opened that was far too high. Therefore the elasticsearch.url is set in the Kibana.yml to point to node-2 ES e.g.
I've just turned Kibana off on node-1 too. Same result - the total_opened value isn't increasing anymore. The JVM Heap usage also dropped instantly to a satisfactory percentage (from 81% to 52%).
It's pretty clearly Kibana. There must be something configured incorrectly but I'm just not sure what. Or is this behaving as it should?
I will ping one of my Kibana colleagues to see if something obvious pops up for them.
For posterity and future googlers I'd like to explain why NEST chose 10k as a default connection limit.
In .NET WebRequests have a ServicePoint property assigned to them and requests sharing the same host have the same servicepoint. Through ServicePointManager you can set the max connections per servicepoint (or hostname by proxy).
HttpWebRequest will first try and reuse a socket that is open before trying to allocate more:
In our performance testing we found that it was very hard to get the max concurrent connections to exceed more then 70-100 from a single box. So we even played with manually assigning ServicePoint's from our own pool of ServicePoint's to webrequest so that you could have 2 or 3 different ServicePoints per host.
While this did pump up the maximum concurrent open persistent connections to a node it did nothing for overall read and write throughput from a single box.
The idea leaving it at 10k was that HttpClient already saturates to a sane true maximum state for concurrently open connections which might be different per machine.
I will leave the commit that dialed this back to a smaller constant so that it behaves more deterministic out of the box.
I'll try to verify the issue and get a fix out, but unfortunately it will take some time before it's available in Kibana (maybe we can get it into 5.1)
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.