Hello!
PHP application is running on different machines. We will gist the jstack as soon as CLOSE_WAIT situation will happen again.
--
Regards,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch
Another question, are the PHP clients running on the same nodes as elasticsearch, or on different nodes (I assume on different nodes, just want to make sure).
I have a suspicion that maybe netty, the networking library we use, does not get around to actually close the connections because of the load on the system. Trying to chase that one down with Trustin. When this happens, can you gist a thread dump (jstack) just so we have it?
But, as suggested, the best thing to do is to use keep alive. nginx by the way can be used to abstract that nicely as Karmi found out, see here: https://gist.github.com/0a2b0e0df83813a4045f/d237b1f2425353edc3b13c593ee2f960dfae0fca.
On Thu, Jun 28, 2012 at 9:38 AM, Rafał Kuć <r.kuc@solr.pl> wrote:
Hello!
Right now, after a few hours the number of CLOSE_WAIT connections are > 20k on all nodes in the cluster, which makes ElasticSearch unresponsive to the API calls on 9200 port. However, clients using Java API are still working without a problem.
--
Regards,
Rafał Kuć
Sematext ::
http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch
Hi Shay,
This is what we are using now:
network:
tcp:
keep_alive: true # also tried setting to false
timeout: 5s # this didn't help
threadpool:
search:
type: fixed # also tried blocking
size: 40
queue_size: 10
reject_policy: abort # also tried client
And ES sees this:
$ curl --silent '11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true' | egrep 'address|current|curr|server_open'
"transport_address" : "inet[/<a style=" font-family:'courier new'; font-size: 9pt;" href="http://11.11.11.11:9300">11.11.11.11:9300</a>]",
"curr_estab" : 764,
"server_open" : 360,
"current_open" : 1053, <== this is approximately the number of CLOSE_WAITs as shown by netstat -T
And this is where we see the threadpool numbers:
$ curl --silent '11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true' | egrep -C 4 search"
"search" : {
"threads" : 40,
"queue" : 0,
"active" : 6
The above is actually from a production environment we are trying to stabilize.
So because of only 40 threads, we do see rejections:
[2012-06-27 21:23:51,541][WARN ][search.action ] [search 2] Failed to send release search context
org.elasticsearch.transport.RemoteTransportException: [search 4][inet[/11.11.11.11:9300]][search/freeContext]
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:33)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
But CLOSE_WAITs are still piling up.
The tricky thing is that under a certain load (< 1200 QPS) CLOSE_WAITs are nowhere to be found. The number of ESTABLISHED connections is more or less constant and the number of CLOSE_WAITs is 0.
But a few minutes after we increase the load to > 1200 QPS we start seeing CLOSE_WAITs and they just keep increasing.
Actually, now that I look at things, I see that even the number of ESTABLISHED connections start to grow at some point, too. Not as fast as CLOSE_WAITs, but growing.
Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.html
On Wednesday, June 27, 2012 6:37:49 PM UTC-4, Rafał Kuć wrote:
Hello!
We tried different settings on nodes - one of it was reject policy abort with size 500 and 120, which resulted in CLOSE_WAIT's under a certain load. Right now I've configured all the nodes in the cluster to 'reject_policy: abort'.
As for the tests, I think it won't be a problem, but first, lets see how ElasticSearch will behave with the current reject policy.
Ah, just to clarify things, we did try 0.19.3, 0.19.4 and now we are running 0.19.7
--
Thanks,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch
-
Based on otis first post, he was using reject_policy of abort, can you clarify what reject_policy was used with the test with no keep alive that resulted in many CLOSE_WAIT?
-
If the CLOSE_WAIT is still a problem with reject policy of abort, can you run the test without keep alive and the two options I asked for?
On Thu, Jun 28, 2012 at 12:11 AM, Rafał Kuć <r.kuc@solr.pl> wrote:
Hello!
We didn't see any CLOSE_WAIT's while we were doing performance testing using keep alive. I'll change reject policy to abort and will see how that goes.
--
Regards,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch
First, can you make sure you ran your test with reject policy of abort for the thread pool?
Second, Can you try two things:
-
After you stop the load test, do you still have CLOSE_WAIT?
-
If you run a single "client" load test, do you see CLOSE_WAIT?
-shay.banon
On Wed, Jun 27, 2012 at 11:12 PM, Otis Gospodnetic <otis.gospodnetic@gmail.com> wrote:
Hi Paul,
On Tuesday, June 26, 2012 2:50:19 AM UTC-4, Paul Brown wrote:
Hi, Otis --
The wikipedia article on TCP has a state chart that may be helpful:
http://en.wikipedia.org/wiki/Transmission_Control_Protocol
CLOSE_WAIT essentially means that the PHP app (libcurl of some sort, I assume) hasn't done a full job of closing the connection, e.g., closing the TCP connection but not the underlying socket, so that's where I'd look. For example, the option CURLOPT_FORBID_REUSE might be useful.
Is that really so?
I looked at this diagram: http://en.wikipedia.org/wiki/File:TCP_CLOSE.svg
If I read that correctly, it looks like CLOSE_WAIT happens when the client (left side) which initiated the connection issues a FIN, which I understand as the client saying "I want to close this connection". After that FIN is received by the server/receiver, that server/receiver goes into the CLOSE_WAIT state and at that time it is supposed to answer by sending the ACK and then (after some time?) by sending FIN going going into LAST_ACK state and then, after client responds with ACK, into CLOSE state.
So if the server side is in CLOSE_WAIT, doesn't that mean that the server received a FIN, but did not send ACK back to client?
Thanks,
Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.html
On Jun 25, 2012, at 8:26 PM, Otis Gospodnetic wrote:
> Hello,
>
> I've been fighting with ES that stops working after it gets hit for several minutes by about 1200 QPS. This is happening with ES 0.19.3 and 0.19.4 on big boxes (24 cores, 96 GB RAM).
>
> What seems to be happening is that after a while we start seeing more and more and more CLOSE_WAIT connections between search clients (~500 frontend PHP apps) and ES, like this:
>
> $ netstat -T | head
> Active Internet connections (w/o servers)
> Proto Recv-Q Send-Q Local Address Foreign Address State
> tcp 325 0 11.11.11.11-static.reverse.softlayer.com:wap-wsp 184.184.184.184-static.reverse.softlayer.com:32035 CLOSE_WAIT
> ...
> ...
>
> The number of CLOSE_WAIT connections goes from being 0 for a while to going up into thousands. And then at some point ES stops responding on port 9200 (but still responds on port 9300).
>
> The number of these CLOSE_WAIT connections seems to roughly correspond to the "current_open" HTTP metric:
>
> $ curl --silent 11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true' | egrep 'address|current|curr|server_open'
> "transport_address" : "inet[/ 11.11.11.11 :9300]",
> "curr_estab" : 87,
> "server_open" : 36,
> "current_open" : 7, <== healthy, just restarted
> "transport_address" : "inet[/ 22.22.22.22 :9300]",
> "curr_estab" : 1245,
> "server_open" : 612,
> "current_open" : 14, <== healthy, just restarted
> "transport_address" : "inet[/ 33.33.33.33:9300]",
> "curr_estab" : 93,
> "server_open" : 36,
> "current_open" : 14, <== healthy, just restarted
> "transport_address" : "inet[/ 44.44.44.44 :9300]",
> "curr_estab" : 171,
> "server_open" : 36,
> "current_open" : 15776, <== baaad, not restarted
>
> I tried using using threadpool (both fixed and blocking with both abort and client rejection policies) to try stopping this "current_open" from growing, e.g.:
>
> threadpool:
> search:
> type: fixed
> size: 120
> queue_size: 100
> reject_policy: abort
>
> But that didn't help.
>
> I should say that the search apps hitting this ES cluster are not using persistent/keep-alive connections. And while this is clearly not ideal and not efficient, I think it still shouldn't cause this "leak" that ends up accumulating connections in CLOSE_WAIT state and eventually getting ES to stop being responsive.
>
> Is there anything one can do on the ES side to more aggressively close connections?
>
> Thanks,
> Otis
> --
> Search Analytics - http://sematext.com/search-analytics/index.html
> Scalable Performance Monitoring - http://sematext.com/spm/index.html
>