Increasing CLOSE_WAIT connections and HTTP current_open metric


(Otis Gospodnetić) #1

Hello,

I've been fighting with ES that stops working after it gets hit for several
minutes by about 1200 QPS. This is happening with ES 0.19.3 and 0.19.4 on
big boxes (24 cores, 96 GB RAM).

What seems to be happening is that after a while we start seeing more and
more and more CLOSE_WAIT connections between search clients (~500 frontend
PHP apps) and ES, like this:

$ netstat -T | head
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address
State
tcp 325 0 11.11.11.11-static.reverse.softlayer.com:wap-wsp
184.184.184.184-static.reverse.softlayer.com:32035 CLOSE_WAIT
...
...

The number of CLOSE_WAIT connections goes from being 0 for a while to going
up into thousands. And then at some point ES stops responding on port 9200
(but still responds on port 9300).

The number of these CLOSE_WAIT connections seems to roughly correspond to
the "current_open" HTTP metric:

$ curl --silent
11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true'
| egrep 'address|current|curr|server_open'
"transport_address" : "inet[/ 11.11.11.11 :9300]",
"curr_estab" : 87,
"server_open" : 36,
"current_open" : 7,
<== healthy, just restarted
"transport_address" : "inet[/ 22.22.22.22 :9300]",
"curr_estab" : 1245,
"server_open" : 612,
"current_open" : 14,
<== healthy, just restarted
"transport_address" : "inet[/ 33.33.33.33:9300]",
"curr_estab" : 93,
"server_open" : 36,
"current_open" : 14,
<== healthy, just restarted
"transport_address" : "inet[/ 44.44.44.44 :9300]",
"curr_estab" : 171,
"server_open" : 36,
"current_open" : 15776,
<== baaad, not restarted

I tried using using threadpool (both fixed and blocking with both abort and
client rejection policies) to try stopping this "current_open" from
growing, e.g.:

threadpool:
search:
type: fixed
size: 120
queue_size: 100
reject_policy: abort

But that didn't help.

I should say that the search apps hitting this ES cluster are not using
persistent/keep-alive connections. And while this is clearly not ideal and
not efficient, I think it still shouldn't cause this "leak" that ends up
accumulating connections in CLOSE_WAIT state and eventually getting ES to
stop being responsive.

Is there anything one can do on the ES side to more aggressively close
connections?

Thanks,
Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.html


(Paul Brown) #2

Hi, Otis --

The wikipedia article on TCP has a state chart that may be helpful:

http://en.wikipedia.org/wiki/Transmission_Control_Protocol

CLOSE_WAIT essentially means that the PHP app (libcurl of some sort, I assume) hasn't done a full job of closing the connection, e.g., closing the TCP connection but not the underlying socket, so that's where I'd look. For example, the option CURLOPT_FORBID_REUSE might be useful.

-- Paul

On Jun 25, 2012, at 8:26 PM, Otis Gospodnetic wrote:

Hello,

I've been fighting with ES that stops working after it gets hit for several minutes by about 1200 QPS. This is happening with ES 0.19.3 and 0.19.4 on big boxes (24 cores, 96 GB RAM).

What seems to be happening is that after a while we start seeing more and more and more CLOSE_WAIT connections between search clients (~500 frontend PHP apps) and ES, like this:

$ netstat -T | head
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 325 0 11.11.11.11-static.reverse.softlayer.com:wap-wsp 184.184.184.184-static.reverse.softlayer.com:32035 CLOSE_WAIT
...
...

The number of CLOSE_WAIT connections goes from being 0 for a while to going up into thousands. And then at some point ES stops responding on port 9200 (but still responds on port 9300).

The number of these CLOSE_WAIT connections seems to roughly correspond to the "current_open" HTTP metric:

$ curl --silent 11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true' | egrep 'address|current|curr|server_open'
"transport_address" : "inet[/ 11.11.11.11 :9300]",
"curr_estab" : 87,
"server_open" : 36,
"current_open" : 7, <== healthy, just restarted
"transport_address" : "inet[/ 22.22.22.22 :9300]",
"curr_estab" : 1245,
"server_open" : 612,
"current_open" : 14, <== healthy, just restarted
"transport_address" : "inet[/ 33.33.33.33:9300]",
"curr_estab" : 93,
"server_open" : 36,
"current_open" : 14, <== healthy, just restarted
"transport_address" : "inet[/ 44.44.44.44 :9300]",
"curr_estab" : 171,
"server_open" : 36,
"current_open" : 15776, <== baaad, not restarted

I tried using using threadpool (both fixed and blocking with both abort and client rejection policies) to try stopping this "current_open" from growing, e.g.:

threadpool:
search:
type: fixed
size: 120
queue_size: 100
reject_policy: abort

But that didn't help.

I should say that the search apps hitting this ES cluster are not using persistent/keep-alive connections. And while this is clearly not ideal and not efficient, I think it still shouldn't cause this "leak" that ends up accumulating connections in CLOSE_WAIT state and eventually getting ES to stop being responsive.

Is there anything one can do on the ES side to more aggressively close connections?

Thanks,
Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.html


(Otis Gospodnetić) #3

Hi Paul,

On Tuesday, June 26, 2012 2:50:19 AM UTC-4, Paul Brown wrote:

Hi, Otis --

The wikipedia article on TCP has a state chart that may be helpful:

http://en.wikipedia.org/wiki/Transmission_Control_Protocol

CLOSE_WAIT essentially means that the PHP app (libcurl of some sort, I
assume) hasn't done a full job of closing the connection, e.g., closing the
TCP connection but not the underlying socket, so that's where I'd look.
For example, the option CURLOPT_FORBID_REUSE might be useful.

Is that really so?
I looked at this diagram: http://en.wikipedia.org/wiki/File:TCP_CLOSE.svg
If I read that correctly, it looks like CLOSE_WAIT happens when the client
(left side) which initiated the connection issues a FIN, which I understand
as the client saying "I want to close this connection". After that FIN is
received by the server/receiver, that server/receiver goes into the
CLOSE_WAIT state and at that time it is supposed to answer by sending the
ACK and then (after some time?) by sending FIN going going into LAST_ACK
state and then, after client responds with ACK, into CLOSE state.

So if the server side is in CLOSE_WAIT, doesn't that mean that the server
received a FIN, but did not send ACK back to client?

Thanks,
Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.html

On Jun 25, 2012, at 8:26 PM, Otis Gospodnetic wrote:

Hello,

I've been fighting with ES that stops working after it gets hit for
several minutes by about 1200 QPS. This is happening with ES 0.19.3 and
0.19.4 on big boxes (24 cores, 96 GB RAM).

What seems to be happening is that after a while we start seeing more
and more and more CLOSE_WAIT connections between search clients (~500
frontend PHP apps) and ES, like this:

$ netstat -T | head
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address
State
tcp 325 0 11.11.11.11-static.reverse.softlayer.com:wap-wsp
184.184.184.184-static.reverse.softlayer.com:32035 CLOSE_WAIT
...
...

The number of CLOSE_WAIT connections goes from being 0 for a while to
going up into thousands. And then at some point ES stops responding on
port 9200 (but still responds on port 9300).

The number of these CLOSE_WAIT connections seems to roughly correspond
to the "current_open" HTTP metric:

$ curl --silent
11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true'
| egrep 'address|current|curr|server_open'
"transport_address" : "inet[/ 11.11.11.11 :9300]",
"curr_estab" : 87,
"server_open" : 36,
"current_open" : 7,
<== healthy, just restarted
"transport_address" : "inet[/ 22.22.22.22 :9300]",
"curr_estab" : 1245,
"server_open" : 612,
"current_open" : 14,
<== healthy, just restarted
"transport_address" : "inet[/ 33.33.33.33:9300]",
"curr_estab" : 93,
"server_open" : 36,
"current_open" : 14,
<== healthy, just restarted
"transport_address" : "inet[/ 44.44.44.44 :9300]",
"curr_estab" : 171,
"server_open" : 36,
"current_open" : 15776,
<== baaad, not restarted

I tried using using threadpool (both fixed and blocking with both abort
and client rejection policies) to try stopping this "current_open" from
growing, e.g.:

threadpool:
search:
type: fixed
size: 120
queue_size: 100
reject_policy: abort

But that didn't help.

I should say that the search apps hitting this ES cluster are not using
persistent/keep-alive connections. And while this is clearly not ideal and
not efficient, I think it still shouldn't cause this "leak" that ends up
accumulating connections in CLOSE_WAIT state and eventually getting ES to
stop being responsive.

Is there anything one can do on the ES side to more aggressively close
connections?

Thanks,
Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.html


(Shay Banon) #4

First, can you make sure you ran your test with reject policy of abort for
the thread pool?
Second, Can you try two things:

  1. After you stop the load test, do you still have CLOSE_WAIT?
  2. If you run a single "client" load test, do you see CLOSE_WAIT?

-shay.banon

On Wed, Jun 27, 2012 at 11:12 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

Hi Paul,

On Tuesday, June 26, 2012 2:50:19 AM UTC-4, Paul Brown wrote:

Hi, Otis --

The wikipedia article on TCP has a state chart that may be helpful:

http://en.wikipedia.org/wiki/**Transmission_Control_Protocolhttp://en.wikipedia.org/wiki/Transmission_Control_Protocol

CLOSE_WAIT essentially means that the PHP app (libcurl of some sort, I
assume) hasn't done a full job of closing the connection, e.g., closing the
TCP connection but not the underlying socket, so that's where I'd look.
For example, the option CURLOPT_FORBID_REUSE might be useful.

Is that really so?
I looked at this diagram: http://en.wikipedia.org/wiki/File:TCP_CLOSE.svg
If I read that correctly, it looks like CLOSE_WAIT happens when the client
(left side) which initiated the connection issues a FIN, which I understand
as the client saying "I want to close this connection". After that FIN is
received by the server/receiver, that server/receiver goes into the
CLOSE_WAIT state and at that time it is supposed to answer by sending the
ACK and then (after some time?) by sending FIN going going into LAST_ACK
state and then, after client responds with ACK, into CLOSE state.

So if the server side is in CLOSE_WAIT, doesn't that mean that the server
received a FIN, but did not send ACK back to client?

Thanks,
Otis

Search Analytics - http://sematext.com/search-**analytics/index.htmlhttp://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.**htmlhttp://sematext.com/spm/index.html

On Jun 25, 2012, at 8:26 PM, Otis Gospodnetic wrote:

Hello,

I've been fighting with ES that stops working after it gets hit for
several minutes by about 1200 QPS. This is happening with ES 0.19.3 and
0.19.4 on big boxes (24 cores, 96 GB RAM).

What seems to be happening is that after a while we start seeing more
and more and more CLOSE_WAIT connections between search clients (~500
frontend PHP apps) and ES, like this:

$ netstat -T | head
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address
State
tcp 325 0 11.11.11.11-static.reverse.**softlayer.com:wap-wsp
184.184.184.184-static.**reverse.softlayer.com:32035http://184.184.184.184-static.reverse.softlayer.com:32035CLOSE_WAIT
...
...

The number of CLOSE_WAIT connections goes from being 0 for a while to
going up into thousands. And then at some point ES stops responding on
port 9200 (but still responds on port 9300).

The number of these CLOSE_WAIT connections seems to roughly
correspond to the "current_open" HTTP metric:

$ curl --silent 11.11.11.11:9200/cluster/nodes/stats?network=true&
transport=true&http=true&**thread_pool=true&indices=**false&pretty=truehttp://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true'
| egrep 'address|current|curr|server
**open'
"transport_address" : "inet[/ 11.11.11.11 :9300]",
"curr_estab" : 87,
"server_open" : 36,
"current_open" : 7,
<== healthy, just restarted
"transport_address" : "inet[/ 22.22.22.22 :9300]",
"curr_estab" : 1245,
"server_open" : 612,
"current_open" : 14,
<== healthy, just restarted
"transport_address" : "inet[/ 33.33.33.33:9300]",
"curr_estab" : 93,
"server_open" : 36,
"current_open" : 14,
<== healthy, just restarted
"transport_address" : "inet[/ 44.44.44.44 :9300]",
"curr_estab" : 171,
"server_open" : 36,
"current_open" : 15776,
<== baaad, not restarted

I tried using using threadpool (both fixed and blocking with both abort
and client rejection policies) to try stopping this "current_open" from
growing, e.g.:

threadpool:
search:
type: fixed
size: 120
queue_size: 100
reject_policy: abort

But that didn't help.

I should say that the search apps hitting this ES cluster are not using
persistent/keep-alive connections. And while this is clearly not ideal and
not efficient, I think it still shouldn't cause this "leak" that ends up
accumulating connections in CLOSE_WAIT state and eventually getting ES to
stop being responsive.

Is there anything one can do on the ES side to more aggressively close
connections?

Thanks,
Otis

Search Analytics - http://sematext.com/search-**analytics/index.htmlhttp://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.**htmlhttp://sematext.com/spm/index.html


(Rafał Kuć) #5

Hello!

We didn't see any CLOSE_WAIT's while we were doing performance testing using keep alive. I'll change reject policy to abort and will see how that goes.

--

Regards,

Rafał Kuć

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

First, can you make sure you ran your test with reject policy of abort for the thread pool?

Second, Can you try two things:

  1. After you stop the load test, do you still have CLOSE_WAIT?

  2. If you run a single "client" load test, do you see CLOSE_WAIT?

-shay.banon

On Wed, Jun 27, 2012 at 11:12 PM, Otis Gospodnetic <otis.gospodnetic@gmail.com> wrote:

Hi Paul,

On Tuesday, June 26, 2012 2:50:19 AM UTC-4, Paul Brown wrote:

Hi, Otis --

The wikipedia article on TCP has a state chart that may be helpful:

http://en.wikipedia.org/wiki/Transmission_Control_Protocol

CLOSE_WAIT essentially means that the PHP app (libcurl of some sort, I assume) hasn't done a full job of closing the connection, e.g., closing the TCP connection but not the underlying socket, so that's where I'd look. For example, the option CURLOPT_FORBID_REUSE might be useful.

Is that really so?

I looked at this diagram: http://en.wikipedia.org/wiki/File:TCP_CLOSE.svg

If I read that correctly, it looks like CLOSE_WAIT happens when the client (left side) which initiated the connection issues a FIN, which I understand as the client saying "I want to close this connection". After that FIN is received by the server/receiver, that server/receiver goes into the CLOSE_WAIT state and at that time it is supposed to answer by sending the ACK and then (after some time?) by sending FIN going going into LAST_ACK state and then, after client responds with ACK, into CLOSE state.

So if the server side is in CLOSE_WAIT, doesn't that mean that the server received a FIN, but did not send ACK back to client?

Thanks,

Otis

--

Search Analytics - http://sematext.com/search-analytics/index.html

Scalable Performance Monitoring - http://sematext.com/spm/index.html

On Jun 25, 2012, at 8:26 PM, Otis Gospodnetic wrote:

> Hello,

>

> I've been fighting with ES that stops working after it gets hit for several minutes by about 1200 QPS. This is happening with ES 0.19.3 and 0.19.4 on big boxes (24 cores, 96 GB RAM).

>

> What seems to be happening is that after a while we start seeing more and more and more CLOSE_WAIT connections between search clients (~500 frontend PHP apps) and ES, like this:

>

> $ netstat -T | head

> Active Internet connections (w/o servers)

> Proto Recv-Q Send-Q Local Address Foreign Address State

> tcp 325 0 11.11.11.11-static.reverse.softlayer.com:wap-wsp 184.184.184.184-static.reverse.softlayer.com:32035 CLOSE_WAIT

> ...

> ...

>

> The number of CLOSE_WAIT connections goes from being 0 for a while to going up into thousands. And then at some point ES stops responding on port 9200 (but still responds on port 9300).

>

> The number of these CLOSE_WAIT connections seems to roughly correspond to the "current_open" HTTP metric:

>

> $ curl --silent 11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true' | egrep 'address|current|curr|server_open'

> "transport_address" : "inet[/ 11.11.11.11 :9300]",

> "curr_estab" : 87,

> "server_open" : 36,

> "current_open" : 7, <== healthy, just restarted

> "transport_address" : "inet[/ 22.22.22.22 :9300]",

> "curr_estab" : 1245,

> "server_open" : 612,

> "current_open" : 14, <== healthy, just restarted

> "transport_address" : "inet[/ 33.33.33.33:9300]",

> "curr_estab" : 93,

> "server_open" : 36,

> "current_open" : 14, <== healthy, just restarted

> "transport_address" : "inet[/ 44.44.44.44 :9300]",

> "curr_estab" : 171,

> "server_open" : 36,

> "current_open" : 15776, <== baaad, not restarted

>

> I tried using using threadpool (both fixed and blocking with both abort and client rejection policies) to try stopping this "current_open" from growing, e.g.:

>

> threadpool:

> search:

> type: fixed

> size: 120

> queue_size: 100

> reject_policy: abort

>

> But that didn't help.

>

> I should say that the search apps hitting this ES cluster are not using persistent/keep-alive connections. And while this is clearly not ideal and not efficient, I think it still shouldn't cause this "leak" that ends up accumulating connections in CLOSE_WAIT state and eventually getting ES to stop being responsive.

>

> Is there anything one can do on the ES side to more aggressively close connections?

>

> Thanks,

> Otis

> --

> Search Analytics - http://sematext.com/search-analytics/index.html

> Scalable Performance Monitoring - http://sematext.com/spm/index.html

>


(Shay Banon) #6
  • Based on otis first post, he was using reject_policy of abort, can you
    clarify what reject_policy was used with the test with no keep alive that
    resulted in many CLOSE_WAIT?

  • If the CLOSE_WAIT is still a problem with reject policy of abort, can you
    run the test without keep alive and the two options I asked for?

On Thu, Jun 28, 2012 at 12:11 AM, Rafał Kuć r.kuc@solr.pl wrote:

Hello!

We didn't see any CLOSE_WAIT's while we were doing performance testing
using keep alive. I'll change reject policy to abort and will see how that
goes.

*--
Regards,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
ElasticSearch

First, can you make sure you ran your test with reject policy of abort
for the thread pool?
Second, Can you try two things:

  1. After you stop the load test, do you still have CLOSE_WAIT?
  2. If you run a single "client" load test, do you see CLOSE_WAIT?

-shay.banon

On Wed, Jun 27, 2012 at 11:12 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:
Hi Paul,

On Tuesday, June 26, 2012 2:50:19 AM UTC-4, Paul Brown wrote:

Hi, Otis --

The wikipedia article on TCP has a state chart that may be helpful:

http://en.wikipedia.org/wiki/http://en.wikipedia.org/wiki/Transmission_Control_Protocol
Transmission_Control_Protocolhttp://en.wikipedia.org/wiki/Transmission_Control_Protocol

CLOSE_WAIT essentially means that the PHP app (libcurl of some sort, I
assume) hasn't done a full job of closing the connection, e.g., closing the
TCP connection but not the underlying socket, so that's where I'd look.
For example, the option CURLOPT_FORBID_REUSE might be useful.

Is that really so?
I looked at this diagram: http://en.wikipedia.org/wiki/File:TCP_CLOSE.svg
If I read that correctly, it looks like CLOSE_WAIT happens when the client
(left side) which initiated the connection issues a FIN, which I understand
as the client saying "I want to close this connection". After that FIN is
received by the server/receiver, that server/receiver goes into the
CLOSE_WAIT state and at that time it is supposed to answer by sending the
ACK and then (after some time?) by sending FIN going going into LAST_ACK
state and then, after client responds with ACK, into CLOSE state.

So if the server side is in CLOSE_WAIT, doesn't that mean that the server
received a FIN, but did not send ACK back to client?

Thanks,
Otis

Search Analytics - http://sematext.com/search-http://sematext.com/search-analytics/index.html
analytics/index.html http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.http://sematext.com/spm/index.html
html http://sematext.com/spm/index.html

On Jun 25, 2012, at 8:26 PM, Otis Gospodnetic wrote:

Hello,

I've been fighting with ES that stops working after it gets hit for
several minutes by about 1200 QPS. This is happening with ES 0.19.3 and
0.19.4 on big boxes (24 cores, 96 GB RAM).

What seems to be happening is that after a while we start seeing more
and more and more CLOSE_WAIT connections between search clients (~500
frontend PHP apps) and ES, like this:

$ netstat -T | head
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address
State
tcp 325 0 11.11.11.11-static.reverse.softlayer.com:wap-wsp
184.184.184.184-static.http://184.184.184.184-static.reverse.softlayer.com:32035
reverse.softlayer.com:32035http://184.184.184.184-static.reverse.softlayer.com:32035
CLOSE_WAIT
...
...

The number of CLOSE_WAIT connections goes from being 0 for a while to
going up into thousands. And then at some point ES stops responding on
port 9200 (but still responds on port 9300).

The number of these CLOSE_WAIT connections seems to roughly correspond
to the "current_open" HTTP metric:

$ curl --silent 11.11.11.11:9200/_cluster/http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
nodes/stats?network=true&http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
transport=true&http=true&http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
thread_pool=true&indices=http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
false&pretty=truehttp://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true'
| egrep 'address|current|curr|server_open'
"transport_address" : "inet[/ 11.11.11.11 :9300]",
"curr_estab" : 87,
"server_open" : 36,
"current_open" : 7,
<== healthy, just restarted
"transport_address" : "inet[/ 22.22.22.22 :9300]",
"curr_estab" : 1245,
"server_open" : 612,
"current_open" : 14,
<== healthy, just restarted
"transport_address" : "inet[/ 33.33.33.33:9300]",
"curr_estab" : 93,
"server_open" : 36,
"current_open" : 14,
<== healthy, just restarted
"transport_address" : "inet[/ 44.44.44.44 :9300]",
"curr_estab" : 171,
"server_open" : 36,
"current_open" : 15776,
<== baaad, not restarted

I tried using using threadpool (both fixed and blocking with both abort
and client rejection policies) to try stopping this "current_open" from
growing, e.g.:

threadpool:
search:
type: fixed
size: 120
queue_size: 100
reject_policy: abort

But that didn't help.

I should say that the search apps hitting this ES cluster are not using
persistent/keep-alive connections. And while this is clearly not ideal and
not efficient, I think it still shouldn't cause this "leak" that ends up
accumulating connections in CLOSE_WAIT state and eventually getting ES to
stop being responsive.

Is there anything one can do on the ES side to more aggressively close
connections?

Thanks,
Otis

Search Analytics - http://sematext.com/search-http://sematext.com/search-analytics/index.html
analytics/index.html http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.http://sematext.com/spm/index.html
html http://sematext.com/spm/index.html


(Rafał Kuć) #7

Hello!

We tried different settings on nodes - one of it was reject policy abort with size 500 and 120, which resulted in CLOSE_WAIT's under a certain load. Right now I've configured all the nodes in the cluster to 'reject_policy: abort'.

As for the tests, I think it won't be a problem, but first, lets see how ElasticSearch will behave with the current reject policy.

Ah, just to clarify things, we did try 0.19.3, 0.19.4 and now we are running 0.19.7 :slight_smile:

--

Thanks,

Rafał Kuć

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

  • Based on otis first post, he was using reject_policy of abort, can you clarify what reject_policy was used with the test with no keep alive that resulted in many CLOSE_WAIT?

  • If the CLOSE_WAIT is still a problem with reject policy of abort, can you run the test without keep alive and the two options I asked for?

On Thu, Jun 28, 2012 at 12:11 AM, Rafał Kuć <r.kuc@solr.pl> wrote:

Hello!

We didn't see any CLOSE_WAIT's while we were doing performance testing using keep alive. I'll change reject policy to abort and will see how that goes.

--

Regards,

Rafał Kuć

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

First, can you make sure you ran your test with reject policy of abort for the thread pool?

Second, Can you try two things:

  1. After you stop the load test, do you still have CLOSE_WAIT?

  2. If you run a single "client" load test, do you see CLOSE_WAIT?

-shay.banon

On Wed, Jun 27, 2012 at 11:12 PM, Otis Gospodnetic <otis.gospodnetic@gmail.com> wrote:

Hi Paul,

On Tuesday, June 26, 2012 2:50:19 AM UTC-4, Paul Brown wrote:

Hi, Otis --

The wikipedia article on TCP has a state chart that may be helpful:

http://en.wikipedia.org/wiki/Transmission_Control_Protocol

CLOSE_WAIT essentially means that the PHP app (libcurl of some sort, I assume) hasn't done a full job of closing the connection, e.g., closing the TCP connection but not the underlying socket, so that's where I'd look. For example, the option CURLOPT_FORBID_REUSE might be useful.

Is that really so?

I looked at this diagram: http://en.wikipedia.org/wiki/File:TCP_CLOSE.svg

If I read that correctly, it looks like CLOSE_WAIT happens when the client (left side) which initiated the connection issues a FIN, which I understand as the client saying "I want to close this connection". After that FIN is received by the server/receiver, that server/receiver goes into the CLOSE_WAIT state and at that time it is supposed to answer by sending the ACK and then (after some time?) by sending FIN going going into LAST_ACK state and then, after client responds with ACK, into CLOSE state.

So if the server side is in CLOSE_WAIT, doesn't that mean that the server received a FIN, but did not send ACK back to client?

Thanks,

Otis

--

Search Analytics - http://sematext.com/search-analytics/index.html

Scalable Performance Monitoring - http://sematext.com/spm/index.html

On Jun 25, 2012, at 8:26 PM, Otis Gospodnetic wrote:

> Hello,

>

> I've been fighting with ES that stops working after it gets hit for several minutes by about 1200 QPS. This is happening with ES 0.19.3 and 0.19.4 on big boxes (24 cores, 96 GB RAM).

>

> What seems to be happening is that after a while we start seeing more and more and more CLOSE_WAIT connections between search clients (~500 frontend PHP apps) and ES, like this:

>

> $ netstat -T | head

> Active Internet connections (w/o servers)

> Proto Recv-Q Send-Q Local Address Foreign Address State

> tcp 325 0 11.11.11.11-static.reverse.softlayer.com:wap-wsp 184.184.184.184-static.reverse.softlayer.com:32035 CLOSE_WAIT

> ...

> ...

>

> The number of CLOSE_WAIT connections goes from being 0 for a while to going up into thousands. And then at some point ES stops responding on port 9200 (but still responds on port 9300).

>

> The number of these CLOSE_WAIT connections seems to roughly correspond to the "current_open" HTTP metric:

>

> $ curl --silent 11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true' | egrep 'address|current|curr|server_open'

> "transport_address" : "inet[/ 11.11.11.11 :9300]",

> "curr_estab" : 87,

> "server_open" : 36,

> "current_open" : 7, <== healthy, just restarted

> "transport_address" : "inet[/ 22.22.22.22 :9300]",

> "curr_estab" : 1245,

> "server_open" : 612,

> "current_open" : 14, <== healthy, just restarted

> "transport_address" : "inet[/ 33.33.33.33:9300]",

> "curr_estab" : 93,

> "server_open" : 36,

> "current_open" : 14, <== healthy, just restarted

> "transport_address" : "inet[/ 44.44.44.44 :9300]",

> "curr_estab" : 171,

> "server_open" : 36,

> "current_open" : 15776, <== baaad, not restarted

>

> I tried using using threadpool (both fixed and blocking with both abort and client rejection policies) to try stopping this "current_open" from growing, e.g.:

>

> threadpool:

> search:

> type: fixed

> size: 120

> queue_size: 100

> reject_policy: abort

>

> But that didn't help.

>

> I should say that the search apps hitting this ES cluster are not using persistent/keep-alive connections. And while this is clearly not ideal and not efficient, I think it still shouldn't cause this "leak" that ends up accumulating connections in CLOSE_WAIT state and eventually getting ES to stop being responsive.

>

> Is there anything one can do on the ES side to more aggressively close connections?

>

> Thanks,

> Otis

> --

> Search Analytics - http://sematext.com/search-analytics/index.html

> Scalable Performance Monitoring - http://sematext.com/spm/index.html

>


(Otis Gospodnetić) #8

Hi Shay,

This is what we are using now:

network:
tcp:
keep_alive: true # also tried setting to false

timeout: 5s # this didn't help

threadpool:
search:
type: fixed # also tried blocking
size: 40
queue_size: 10
reject_policy: abort # also tried client

And ES sees this:

$ curl --silent
'11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true'
| egrep 'address|current|curr|server_open'
"transport_address" : "inet[/11.11.11.11:9300]",
"curr_estab" : 764,
"server_open" : 360,
"current_open" : 1053, <== this is approximately the number
of CLOSE_WAITs as shown by netstat -T

And this is where we see the threadpool numbers:

$ curl --silent
'11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true'
| egrep -C 4 search"

    "search" : {
      "threads" : 40,
      "queue" : 0,
      "active" : 6

The above is actually from a production environment we are trying to
stabilize.

So because of only 40 threads, we do see rejections:

[2012-06-27 21:23:51,541][WARN ][search.action ] [search 2]
Failed to send release search context
org.elasticsearch.transport.RemoteTransportException: [search
4][inet[/11.11.11.11:9300]][search/freeContext]
Caused by:
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException
at
org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:33)
at
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)

But CLOSE_WAITs are still piling up.

The tricky thing is that under a certain load (< 1200 QPS) CLOSE_WAITs are
nowhere to be found. The number of ESTABLISHED connections is more or less
constant and the number of CLOSE_WAITs is 0.
But a few minutes after we increase the load to > 1200 QPS we start seeing
CLOSE_WAITs and they just keep increasing.
Actually, now that I look at things, I see that even the number of
ESTABLISHED connections start to grow at some point, too. Not as fast as
CLOSE_WAITs, but growing.

Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.html

On Wednesday, June 27, 2012 6:37:49 PM UTC-4, Rafał Kuć wrote:

Hello!

We tried different settings on nodes - one of it was reject policy abort
with size 500 and 120, which resulted in CLOSE_WAIT's under a certain load.
Right now I've configured all the nodes in the cluster to 'reject_policy:
abort'.

As for the tests, I think it won't be a problem, but first, lets see how
ElasticSearch will behave with the current reject policy.

Ah, just to clarify things, we did try 0.19.3, 0.19.4 and now we are
running 0.19.7 :slight_smile:

*--
Thanks,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
ElasticSearch

  • Based on otis first post, he was using reject_policy of abort, can you
    clarify what reject_policy was used with the test with no keep alive that
    resulted in many CLOSE_WAIT?

  • If the CLOSE_WAIT is still a problem with reject policy of abort, can
    you run the test without keep alive and the two options I asked for?

On Thu, Jun 28, 2012 at 12:11 AM, Rafał Kuć r.kuc@solr.pl wrote:
Hello!

We didn't see any CLOSE_WAIT's while we were doing performance testing
using keep alive. I'll change reject policy to abort and will see how that
goes.

*--
Regards,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
ElasticSearch

First, can you make sure you ran your test with reject policy of abort
for the thread pool?
Second, Can you try two things:

  1. After you stop the load test, do you still have CLOSE_WAIT?
  2. If you run a single "client" load test, do you see CLOSE_WAIT?

-shay.banon

On Wed, Jun 27, 2012 at 11:12 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:
Hi Paul,

On Tuesday, June 26, 2012 2:50:19 AM UTC-4, Paul Brown wrote:

Hi, Otis --

The wikipedia article on TCP has a state chart that may be helpful:

http://en.wikipedia.org/wiki/http://en.wikipedia.org/wiki/Transmission_Control_Protocol
Transmission_Control_Protocolhttp://en.wikipedia.org/wiki/Transmission_Control_Protocol

CLOSE_WAIT essentially means that the PHP app (libcurl of some sort, I
assume) hasn't done a full job of closing the connection, e.g., closing the
TCP connection but not the underlying socket, so that's where I'd look.
For example, the option CURLOPT_FORBID_REUSE might be useful.

Is that really so?
I looked at this diagram: http://en.wikipedia.org/wiki/File:TCP_CLOSE.svg
If I read that correctly, it looks like CLOSE_WAIT happens when the client
(left side) which initiated the connection issues a FIN, which I understand
as the client saying "I want to close this connection". After that FIN is
received by the server/receiver, that server/receiver goes into the
CLOSE_WAIT state and at that time it is supposed to answer by sending the
ACK and then (after some time?) by sending FIN going going into LAST_ACK
state and then, after client responds with ACK, into CLOSE state.

So if the server side is in CLOSE_WAIT, doesn't that mean that the server
received a FIN, but did not send ACK back to client?

Thanks,
Otis

Search Analytics - http://sematext.com/search-http://sematext.com/search-analytics/index.html
analytics/index.html http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.http://sematext.com/spm/index.html
html http://sematext.com/spm/index.html

On Jun 25, 2012, at 8:26 PM, Otis Gospodnetic wrote:

Hello,

I've been fighting with ES that stops working after it gets hit for
several minutes by about 1200 QPS. This is happening with ES 0.19.3 and
0.19.4 on big boxes (24 cores, 96 GB RAM).

What seems to be happening is that after a while we start seeing more
and more and more CLOSE_WAIT connections between search clients (~500
frontend PHP apps) and ES, like this:

$ netstat -T | head
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address
State
tcp 325 0 11.11.11.11-static.reverse.softlayer.com:wap-wsp
184.184.184.184-static.http://184.184.184.184-static.reverse.softlayer.com:32035
reverse.softlayer.com:32035http://184.184.184.184-static.reverse.softlayer.com:32035
CLOSE_WAIT
...
...

The number of CLOSE_WAIT connections goes from being 0 for a while to
going up into thousands. And then at some point ES stops responding on
port 9200 (but still responds on port 9300).

The number of these CLOSE_WAIT connections seems to roughly correspond
to the "current_open" HTTP metric:

$ curl --silent 11.11.11.11:9200/_cluster/http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
nodes/stats?network=true&http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
transport=true&http=true&http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
thread_pool=true&indices=http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
false&pretty=truehttp://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true'
| egrep 'address|current|curr|server_open'
"transport_address" : "inet[/ 11.11.11.11 :9300]",
"curr_estab" : 87,
"server_open" : 36,
"current_open" : 7,
<== healthy, just restarted
"transport_address" : "inet[/ 22.22.22.22 :9300]",
"curr_estab" : 1245,
"server_open" : 612,
"current_open" : 14,
<== healthy, just restarted
"transport_address" : "inet[/ 33.33.33.33:9300]",
"curr_estab" : 93,
"server_open" : 36,
"current_open" : 14,
<== healthy, just restarted
"transport_address" : "inet[/ 44.44.44.44 :9300]",
"curr_estab" : 171,
"server_open" : 36,
"current_open" : 15776,
<== baaad, not restarted

I tried using using threadpool (both fixed and blocking with both abort
and client rejection policies) to try stopping this "current_open" from
growing, e.g.:

threadpool:
search:
type: fixed
size: 120
queue_size: 100
reject_policy: abort

But that didn't help.

I should say that the search apps hitting this ES cluster are not using
persistent/keep-alive connections. And while this is clearly not ideal and
not efficient, I think it still shouldn't cause this "leak" that ends up
accumulating connections in CLOSE_WAIT state and eventually getting ES to
stop being responsive.

Is there anything one can do on the ES side to more aggressively close
connections?

Thanks,
Otis

Search Analytics - http://sematext.com/search-http://sematext.com/search-analytics/index.html
analytics/index.html http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.http://sematext.com/spm/index.html
html http://sematext.com/spm/index.html


(Rafał Kuć) #9

Hello!

Right now, after a few hours the number of CLOSE_WAIT connections are > 20k on all nodes in the cluster, which makes ElasticSearch unresponsive to the API calls on 9200 port. However, clients using Java API are still working without a problem.

--

Regards,

Rafał Kuć

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

Hi Shay,

This is what we are using now:

network:

tcp:

keep_alive: true  # also tried setting to false

timeout: 5s # this didn't help

threadpool:

search:

type: fixed                # also tried blocking


size: 40


queue_size: 10


reject_policy: abort   # also tried client

And ES sees this:

$ curl --silent '11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true' | egrep 'address|current|curr|server_open'

  "transport_address" : "inet[/11.11.11.11:9300]",


      "curr_estab" : 764,


    "server_open" : 360,


    "current_open" : 1053,         &lt;== this is approximately the number of CLOSE_WAITs as shown by netstat -T

And this is where we see the threadpool numbers:

$ curl --silent '11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true' | egrep -C 4 search"

    "search" : {


      "threads" : 40,


      "queue" : 0,


      "active" : 6

The above is actually from a production environment we are trying to stabilize.

So because of only 40 threads, we do see rejections:

[2012-06-27 21:23:51,541][WARN ][search.action ] [search 2] Failed to send release search context

org.elasticsearch.transport.RemoteTransportException: [search 4][inet[/11.11.11.11:9300]][search/freeContext]

Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException

    at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:33)


    at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)

But CLOSE_WAITs are still piling up.

The tricky thing is that under a certain load (< 1200 QPS) CLOSE_WAITs are nowhere to be found. The number of ESTABLISHED connections is more or less constant and the number of CLOSE_WAITs is 0.

But a few minutes after we increase the load to > 1200 QPS we start seeing CLOSE_WAITs and they just keep increasing.

Actually, now that I look at things, I see that even the number of ESTABLISHED connections start to grow at some point, too. Not as fast as CLOSE_WAITs, but growing.

Otis

--

Search Analytics - http://sematext.com/search-analytics/index.html

Scalable Performance Monitoring - http://sematext.com/spm/index.html

On Wednesday, June 27, 2012 6:37:49 PM UTC-4, Rafał Kuć wrote:

Hello!

We tried different settings on nodes - one of it was reject policy abort with size 500 and 120, which resulted in CLOSE_WAIT's under a certain load. Right now I've configured all the nodes in the cluster to 'reject_policy: abort'.

As for the tests, I think it won't be a problem, but first, lets see how ElasticSearch will behave with the current reject policy.

Ah, just to clarify things, we did try 0.19.3, 0.19.4 and now we are running 0.19.7 :slight_smile:

--

Thanks,

Rafał Kuć

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

  • Based on otis first post, he was using reject_policy of abort, can you clarify what reject_policy was used with the test with no keep alive that resulted in many CLOSE_WAIT?

  • If the CLOSE_WAIT is still a problem with reject policy of abort, can you run the test without keep alive and the two options I asked for?

On Thu, Jun 28, 2012 at 12:11 AM, Rafał Kuć <r.kuc@solr.pl> wrote:

Hello!

We didn't see any CLOSE_WAIT's while we were doing performance testing using keep alive. I'll change reject policy to abort and will see how that goes.

--

Regards,

Rafał Kuć

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

First, can you make sure you ran your test with reject policy of abort for the thread pool?

Second, Can you try two things:

  1. After you stop the load test, do you still have CLOSE_WAIT?

  2. If you run a single "client" load test, do you see CLOSE_WAIT?

-shay.banon

On Wed, Jun 27, 2012 at 11:12 PM, Otis Gospodnetic <otis.gospodnetic@gmail.com> wrote:

Hi Paul,

On Tuesday, June 26, 2012 2:50:19 AM UTC-4, Paul Brown wrote:

Hi, Otis --

The wikipedia article on TCP has a state chart that may be helpful:

http://en.wikipedia.org/wiki/Transmission_Control_Protocol

CLOSE_WAIT essentially means that the PHP app (libcurl of some sort, I assume) hasn't done a full job of closing the connection, e.g., closing the TCP connection but not the underlying socket, so that's where I'd look. For example, the option CURLOPT_FORBID_REUSE might be useful.

Is that really so?

I looked at this diagram: http://en.wikipedia.org/wiki/File:TCP_CLOSE.svg

If I read that correctly, it looks like CLOSE_WAIT happens when the client (left side) which initiated the connection issues a FIN, which I understand as the client saying "I want to close this connection". After that FIN is received by the server/receiver, that server/receiver goes into the CLOSE_WAIT state and at that time it is supposed to answer by sending the ACK and then (after some time?) by sending FIN going going into LAST_ACK state and then, after client responds with ACK, into CLOSE state.

So if the server side is in CLOSE_WAIT, doesn't that mean that the server received a FIN, but did not send ACK back to client?

Thanks,

Otis

--

Search Analytics - http://sematext.com/search-analytics/index.html

Scalable Performance Monitoring - http://sematext.com/spm/index.html

On Jun 25, 2012, at 8:26 PM, Otis Gospodnetic wrote:

> Hello,

>

> I've been fighting with ES that stops working after it gets hit for several minutes by about 1200 QPS. This is happening with ES 0.19.3 and 0.19.4 on big boxes (24 cores, 96 GB RAM).

>

> What seems to be happening is that after a while we start seeing more and more and more CLOSE_WAIT connections between search clients (~500 frontend PHP apps) and ES, like this:

>

> $ netstat -T | head

> Active Internet connections (w/o servers)

> Proto Recv-Q Send-Q Local Address Foreign Address State

> tcp 325 0 11.11.11.11-static.reverse.softlayer.com:wap-wsp 184.184.184.184-static.reverse.softlayer.com:32035 CLOSE_WAIT

> ...

> ...

>

> The number of CLOSE_WAIT connections goes from being 0 for a while to going up into thousands. And then at some point ES stops responding on port 9200 (but still responds on port 9300).

>

> The number of these CLOSE_WAIT connections seems to roughly correspond to the "current_open" HTTP metric:

>

> $ curl --silent 11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true' | egrep 'address|current|curr|server_open'

> "transport_address" : "inet[/ 11.11.11.11 :9300]",

> "curr_estab" : 87,

> "server_open" : 36,

> "current_open" : 7, <== healthy, just restarted

> "transport_address" : "inet[/ 22.22.22.22 :9300]",

> "curr_estab" : 1245,

> "server_open" : 612,

> "current_open" : 14, <== healthy, just restarted

> "transport_address" : "inet[/ 33.33.33.33:9300]",

> "curr_estab" : 93,

> "server_open" : 36,

> "current_open" : 14, <== healthy, just restarted

> "transport_address" : "inet[/ 44.44.44.44 :9300]",

> "curr_estab" : 171,

> "server_open" : 36,

> "current_open" : 15776, <== baaad, not restarted

>

> I tried using using threadpool (both fixed and blocking with both abort and client rejection policies) to try stopping this "current_open" from growing, e.g.:

>

> threadpool:

> search:

> type: fixed

> size: 120

> queue_size: 100

> reject_policy: abort

>

> But that didn't help.

>

> I should say that the search apps hitting this ES cluster are not using persistent/keep-alive connections. And while this is clearly not ideal and not efficient, I think it still shouldn't cause this "leak" that ends up accumulating connections in CLOSE_WAIT state and eventually getting ES to stop being responsive.

>

> Is there anything one can do on the ES side to more aggressively close connections?

>

> Thanks,

> Otis

> --

> Search Analytics - http://sematext.com/search-analytics/index.html

> Scalable Performance Monitoring - http://sematext.com/spm/index.html

>


(Jörg Prante) #10

FYI, in a non-ES context, I had to deal with many low-level socket
server/client error conditions, and encountered also a lot of CLOSE_WAIT
states that pile up and could make systems at worst case completely stall
for minutes. As described, the client aborts the close prematurely and is
not behaving correctly. As a consequence, the operating system has a
system-wide tcp timeout before removing the socket. One (not so good)
method is to tweak this operating system value (I was surprised how
differently Linux and Solaris managed such cases).

The best method I found to work around this issue was both server and
client using keepalive connections and handle the timeout states.

It would also be interesting to examine the curl / libcurl version,
probably it is an old one and there is some issue there.

Best,

Jörg


(Shay Banon) #11

Another question, are the PHP clients running on the same nodes as
elasticsearch, or on different nodes (I assume on different nodes, just
want to make sure).

I have a suspicion that maybe netty, the networking library we use, does
not get around to actually close the connections because of the load on the
system. Trying to chase that one down with Trustin. When this happens, can
you gist a thread dump (jstack) just so we have it?

But, as suggested, the best thing to do is to use keep alive. nginx by the
way can be used to abstract that nicely as Karmi found out, see here:


.

On Thu, Jun 28, 2012 at 9:38 AM, Rafał Kuć r.kuc@solr.pl wrote:

Hello!

Right now, after a few hours the number of CLOSE_WAIT connections are >
20k on all nodes in the cluster, which makes ElasticSearch unresponsive to
the API calls on 9200 port. However, clients using Java API are still
working without a problem.

--
Regards,
Rafał Kuć
Sematext :: *
http://sematext.com/
:: Solr - Lucene - Nutch - ElasticSearch

Hi Shay,

This is what we are using now:

network:
tcp:
keep_alive: true # also tried setting to false

timeout: 5s # this didn't help

threadpool:
search:
type: fixed # also tried blocking
size: 40
queue_size: 10
reject_policy: abort # also tried client

And ES sees this:

$ curl --silent '
11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true'
| egrep 'address|current|curr|server_open'
"transport_address" : "inet[/11.11.11.11:9300]",
"curr_estab" : 764,
"server_open" : 360,
"current_open" : 1053, <== this is approximately the
number of CLOSE_WAITs as shown by netstat -T

And this is where we see the threadpool numbers:

$ curl --silent '
11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true'
| egrep -C 4 search"

    "search" : {
      "threads" : 40,
      "queue" : 0,
      "active" : 6

The above is actually from a production environment we are trying to
stabilize.

So because of only 40 threads, we do see rejections:

[2012-06-27 21:23:51,541][WARN ][search.action ] [search 2]
Failed to send release search context
org.elasticsearch.transport.RemoteTransportException: [search
4][inet[/11.11.11.11:9300]][search/freeContext]
Caused by:
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException
at
org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:33)
at
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)

But CLOSE_WAITs are still piling up.

The tricky thing is that under a certain load (< 1200 QPS) CLOSE_WAITs are
nowhere to be found. The number of ESTABLISHED connections is more or less
constant and the number of CLOSE_WAITs is 0.
But a few minutes after we increase the load to > 1200 QPS we start seeing
CLOSE_WAITs and they just keep increasing.
Actually, now that I look at things, I see that even the number of
ESTABLISHED connections start to grow at some point, too. Not as fast as
CLOSE_WAITs, but growing.

Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.html

On Wednesday, June 27, 2012 6:37:49 PM UTC-4, Rafał Kuć wrote:
Hello!

We tried different settings on nodes - one of it was reject policy abort
with size 500 and 120, which resulted in CLOSE_WAIT's under a certain load.
Right now I've configured all the nodes in the cluster to 'reject_policy:
abort'.

As for the tests, I think it won't be a problem, but first, lets see how
ElasticSearch will behave with the current reject policy.

Ah, just to clarify things, we did try 0.19.3, 0.19.4 and now we are
running 0.19.7 :slight_smile:

*--
Thanks,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
ElasticSearch

  • Based on otis first post, he was using reject_policy of abort, can you
    clarify what reject_policy was used with the test with no keep alive that
    resulted in many CLOSE_WAIT?

  • If the CLOSE_WAIT is still a problem with reject policy of abort, can
    you run the test without keep alive and the two options I asked for?

On Thu, Jun 28, 2012 at 12:11 AM, Rafał Kuć r.kuc@solr.pl wrote:
Hello!

We didn't see any CLOSE_WAIT's while we were doing performance testing
using keep alive. I'll change reject policy to abort and will see how that
goes.

*--
Regards,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
ElasticSearch

First, can you make sure you ran your test with reject policy of abort
for the thread pool?
Second, Can you try two things:

  1. After you stop the load test, do you still have CLOSE_WAIT?
  2. If you run a single "client" load test, do you see CLOSE_WAIT?

-shay.banon

On Wed, Jun 27, 2012 at 11:12 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:
Hi Paul,

On Tuesday, June 26, 2012 2:50:19 AM UTC-4, Paul Brown wrote:

Hi, Otis --

The wikipedia article on TCP has a state chart that may be helpful:

http://en.wikipedia.org/http://en.wikipedia.org/wiki/Transmission_Control_Protocol
wiki/ http://en.wikipedia.org/wiki/Transmission_Control_Protocol
Transmission_Control_http://en.wikipedia.org/wiki/Transmission_Control_Protocol
Protocol http://en.wikipedia.org/wiki/Transmission_Control_Protocol

CLOSE_WAIT essentially means that the PHP app (libcurl of some sort, I
assume) hasn't done a full job of closing the connection, e.g., closing the
TCP connection but not the underlying socket, so that's where I'd look.
For example, the option CURLOPT_FORBID_REUSE might be useful.

Is that really so?
I looked at this diagram: http://en.wikipedia.http://en.wikipedia.org/wiki/File:TCP_CLOSE.svg
org/wiki/File:TCP_CLOSE.svghttp://en.wikipedia.org/wiki/File:TCP_CLOSE.svg
If I read that correctly, it looks like CLOSE_WAIT happens when the client
(left side) which initiated the connection issues a FIN, which I understand
as the client saying "I want to close this connection". After that FIN is
received by the server/receiver, that server/receiver goes into the
CLOSE_WAIT state and at that time it is supposed to answer by sending the
ACK and then (after some time?) by sending FIN going going into LAST_ACK
state and then, after client responds with ACK, into CLOSE state.

So if the server side is in CLOSE_WAIT, doesn't that mean that the server
received a FIN, but did not send ACK back to client?

Thanks,
Otis

Search Analytics - http://sematext.com/search-http://sematext.com/search-analytics/index.html
a http://sematext.com/search-analytics/index.htmlnalytics/index.htmlhttp://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/http://sematext.com/spm/index.html
index. http://sematext.com/spm/index.htmlhtmlhttp://sematext.com/spm/index.html

On Jun 25, 2012, at 8:26 PM, Otis Gospodnetic wrote:

Hello,

I've been fighting with ES that stops working after it gets hit for
several minutes by about 1200 QPS. This is happening with ES 0.19.3 and
0.19.4 on big boxes (24 cores, 96 GB RAM).

What seems to be happening is that after a while we start seeing more
and more and more CLOSE_WAIT connections between search clients (~500
frontend PHP apps) and ES, like this:

$ netstat -T | head
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address
State
tcp 325 0 11.11.11.11-static.reverse.softlayer.com:wap-wsp
184.184. http://184.184.184.184-static.reverse.softlayer.com:32035
184.184-static.http://184.184.184.184-static.reverse.softlayer.com:32035
reverse. http://184.184.184.184-static.reverse.softlayer.com:32035
softlayer.com:32035http://184.184.184.184-static.reverse.softlayer.com:32035
CLOSE_WAIT
...
...

The number of CLOSE_WAIT connections goes from being 0 for a while to
going up into thousands. And then at some point ES stops responding on
port 9200 (but still responds on port 9300).

The number of these CLOSE_WAIT connections seems to roughly correspond
to the "current_open" HTTP metric:

$ curl --silent 11.11.11.11:9200/_http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
cluster/http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
nodes/stats?network=http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
true&http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
transport=true&http=true&http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
thread_pool=true&indices=http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
falsehttp://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
&pretty=truehttp://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true'
| egrep 'address|current|curr|server_open'
"transport_address" : "inet[/ 11.11.11.11 :9300]",
"curr_estab" : 87,
"server_open" : 36,
"current_open" : 7,
<== healthy, just restarted
"transport_address" : "inet[/ 22.22.22.22 :9300]",
"curr_estab" : 1245,
"server_open" : 612,
"current_open" : 14,
<== healthy, just restarted
"transport_address" : "inet[/ 33.33.33.33:9300]",
"curr_estab" : 93,
"server_open" : 36,
"current_open" : 14,
<== healthy, just restarted
"transport_address" : "inet[/ 44.44.44.44 :9300]",
"curr_estab" : 171,
"server_open" : 36,
"current_open" : 15776,
<== baaad, not restarted

I tried using using threadpool (both fixed and blocking with both abort
and client rejection policies) to try stopping this "current_open" from
growing, e.g.:

threadpool:
search:
type: fixed
size: 120
queue_size: 100
reject_policy: abort

But that didn't help.

I should say that the search apps hitting this ES cluster are not using
persistent/keep-alive connections. And while this is clearly not ideal and
not efficient, I think it still shouldn't cause this "leak" that ends up
accumulating connections in CLOSE_WAIT state and eventually getting ES to
stop being responsive.

Is there anything one can do on the ES side to more aggressively close
connections?

Thanks,
Otis

Search Analytics - http://sematext.com/search-http://sematext.com/search-analytics/index.html
a http://sematext.com/search-analytics/index.htmlnalytics/index.htmlhttp://sematext.com/search-analytics/index.html

Scalable Performance Monitoring - http://sematext.com/spm/http://sematext.com/spm/index.html
index. http://sematext.com/spm/index.htmlhtmlhttp://sematext.com/spm/index.html


(Rafał Kuć) #12

Hello!

PHP application is running on different machines. We will gist the jstack as soon as CLOSE_WAIT situation will happen again.

--

Regards,

Rafał Kuć

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

Another question, are the PHP clients running on the same nodes as elasticsearch, or on different nodes (I assume on different nodes, just want to make sure).

I have a suspicion that maybe netty, the networking library we use, does not get around to actually close the connections because of the load on the system. Trying to chase that one down with Trustin. When this happens, can you gist a thread dump (jstack) just so we have it?

But, as suggested, the best thing to do is to use keep alive. nginx by the way can be used to abstract that nicely as Karmi found out, see here: https://gist.github.com/0a2b0e0df83813a4045f/d237b1f2425353edc3b13c593ee2f960dfae0fca.

On Thu, Jun 28, 2012 at 9:38 AM, Rafał Kuć <r.kuc@solr.pl> wrote:

Hello!

Right now, after a few hours the number of CLOSE_WAIT connections are > 20k on all nodes in the cluster, which makes ElasticSearch unresponsive to the API calls on 9200 port. However, clients using Java API are still working without a problem.

--

Regards,

Rafał Kuć

Sematext ::

http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

Hi Shay,

This is what we are using now:

network:

tcp:

keep_alive: true  # also tried setting to false

timeout: 5s # this didn't help

threadpool:

search:

type: fixed                # also tried blocking


size: 40


queue_size: 10


reject_policy: abort   # also tried client

And ES sees this:

$ curl --silent '11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true' | egrep 'address|current|curr|server_open'

  "transport_address" : "inet[/<a style=" font-family:'courier new'; font-size: 9pt;" href="http://11.11.11.11:9300">11.11.11.11:9300</a>]",


      "curr_estab" : 764,


    "server_open" : 360,


    "current_open" : 1053,         &lt;== this is approximately the number of CLOSE_WAITs as shown by netstat -T

And this is where we see the threadpool numbers:

$ curl --silent '11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true' | egrep -C 4 search"

    "search" : {


      "threads" : 40,


      "queue" : 0,


      "active" : 6

The above is actually from a production environment we are trying to stabilize.

So because of only 40 threads, we do see rejections:

[2012-06-27 21:23:51,541][WARN ][search.action ] [search 2] Failed to send release search context

org.elasticsearch.transport.RemoteTransportException: [search 4][inet[/11.11.11.11:9300]][search/freeContext]

Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException

    at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:33)


    at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)

But CLOSE_WAITs are still piling up.

The tricky thing is that under a certain load (< 1200 QPS) CLOSE_WAITs are nowhere to be found. The number of ESTABLISHED connections is more or less constant and the number of CLOSE_WAITs is 0.

But a few minutes after we increase the load to > 1200 QPS we start seeing CLOSE_WAITs and they just keep increasing.

Actually, now that I look at things, I see that even the number of ESTABLISHED connections start to grow at some point, too. Not as fast as CLOSE_WAITs, but growing.

Otis

--

Search Analytics - http://sematext.com/search-analytics/index.html

Scalable Performance Monitoring - http://sematext.com/spm/index.html

On Wednesday, June 27, 2012 6:37:49 PM UTC-4, Rafał Kuć wrote:

Hello!

We tried different settings on nodes - one of it was reject policy abort with size 500 and 120, which resulted in CLOSE_WAIT's under a certain load. Right now I've configured all the nodes in the cluster to 'reject_policy: abort'.

As for the tests, I think it won't be a problem, but first, lets see how ElasticSearch will behave with the current reject policy.

Ah, just to clarify things, we did try 0.19.3, 0.19.4 and now we are running 0.19.7 :slight_smile:

--

Thanks,

Rafał Kuć

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

  • Based on otis first post, he was using reject_policy of abort, can you clarify what reject_policy was used with the test with no keep alive that resulted in many CLOSE_WAIT?

  • If the CLOSE_WAIT is still a problem with reject policy of abort, can you run the test without keep alive and the two options I asked for?

On Thu, Jun 28, 2012 at 12:11 AM, Rafał Kuć <r.kuc@solr.pl> wrote:

Hello!

We didn't see any CLOSE_WAIT's while we were doing performance testing using keep alive. I'll change reject policy to abort and will see how that goes.

--

Regards,

Rafał Kuć

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

First, can you make sure you ran your test with reject policy of abort for the thread pool?

Second, Can you try two things:

  1. After you stop the load test, do you still have CLOSE_WAIT?

  2. If you run a single "client" load test, do you see CLOSE_WAIT?

-shay.banon

On Wed, Jun 27, 2012 at 11:12 PM, Otis Gospodnetic <otis.gospodnetic@gmail.com> wrote:

Hi Paul,

On Tuesday, June 26, 2012 2:50:19 AM UTC-4, Paul Brown wrote:

Hi, Otis --

The wikipedia article on TCP has a state chart that may be helpful:

http://en.wikipedia.org/wiki/Transmission_Control_Protocol

CLOSE_WAIT essentially means that the PHP app (libcurl of some sort, I assume) hasn't done a full job of closing the connection, e.g., closing the TCP connection but not the underlying socket, so that's where I'd look. For example, the option CURLOPT_FORBID_REUSE might be useful.

Is that really so?

I looked at this diagram: http://en.wikipedia.org/wiki/File:TCP_CLOSE.svg

If I read that correctly, it looks like CLOSE_WAIT happens when the client (left side) which initiated the connection issues a FIN, which I understand as the client saying "I want to close this connection". After that FIN is received by the server/receiver, that server/receiver goes into the CLOSE_WAIT state and at that time it is supposed to answer by sending the ACK and then (after some time?) by sending FIN going going into LAST_ACK state and then, after client responds with ACK, into CLOSE state.

So if the server side is in CLOSE_WAIT, doesn't that mean that the server received a FIN, but did not send ACK back to client?

Thanks,

Otis

--

Search Analytics - http://sematext.com/search-analytics/index.html

Scalable Performance Monitoring - http://sematext.com/spm/index.html

On Jun 25, 2012, at 8:26 PM, Otis Gospodnetic wrote:

> Hello,

>

> I've been fighting with ES that stops working after it gets hit for several minutes by about 1200 QPS. This is happening with ES 0.19.3 and 0.19.4 on big boxes (24 cores, 96 GB RAM).

>

> What seems to be happening is that after a while we start seeing more and more and more CLOSE_WAIT connections between search clients (~500 frontend PHP apps) and ES, like this:

>

> $ netstat -T | head

> Active Internet connections (w/o servers)

> Proto Recv-Q Send-Q Local Address Foreign Address State

> tcp 325 0 11.11.11.11-static.reverse.softlayer.com:wap-wsp 184.184.184.184-static.reverse.softlayer.com:32035 CLOSE_WAIT

> ...

> ...

>

> The number of CLOSE_WAIT connections goes from being 0 for a while to going up into thousands. And then at some point ES stops responding on port 9200 (but still responds on port 9300).

>

> The number of these CLOSE_WAIT connections seems to roughly correspond to the "current_open" HTTP metric:

>

> $ curl --silent 11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true' | egrep 'address|current|curr|server_open'

> "transport_address" : "inet[/ 11.11.11.11 :9300]",

> "curr_estab" : 87,

> "server_open" : 36,

> "current_open" : 7, <== healthy, just restarted

> "transport_address" : "inet[/ 22.22.22.22 :9300]",

> "curr_estab" : 1245,

> "server_open" : 612,

> "current_open" : 14, <== healthy, just restarted

> "transport_address" : "inet[/ 33.33.33.33:9300]",

> "curr_estab" : 93,

> "server_open" : 36,

> "current_open" : 14, <== healthy, just restarted

> "transport_address" : "inet[/ 44.44.44.44 :9300]",

> "curr_estab" : 171,

> "server_open" : 36,

> "current_open" : 15776, <== baaad, not restarted

>

> I tried using using threadpool (both fixed and blocking with both abort and client rejection policies) to try stopping this "current_open" from growing, e.g.:

>

> threadpool:

> search:

> type: fixed

> size: 120

> queue_size: 100

> reject_policy: abort

>

> But that didn't help.

>

> I should say that the search apps hitting this ES cluster are not using persistent/keep-alive connections. And while this is clearly not ideal and not efficient, I think it still shouldn't cause this "leak" that ends up accumulating connections in CLOSE_WAIT state and eventually getting ES to stop being responsive.

>

> Is there anything one can do on the ES side to more aggressively close connections?

>

> Thanks,

> Otis

> --

> Search Analytics - http://sematext.com/search-analytics/index.html

> Scalable Performance Monitoring - http://sematext.com/spm/index.html

>


(Rafał Kuć) #13

Hello,

Shay, I've created a gist from one of the nodes, when CLOSE_WAIT were
happening again - https://gist.github.com/3018085. Basically, the node just
got unresponsive, not responding to searches and the CLOSE_WAIT's are
increasing. Ah one more thing - there was more than 250 CLOSE_WAIT's while
I was creating that dump.

Thanks,
Rafał Kuć

W dniu piątek, 29 czerwca 2012 09:26:39 UTC+2 użytkownik Rafał Kuć napisał:

Hello!

PHP application is running on different machines. We will gist the jstack
as soon as CLOSE_WAIT situation will happen again.

*--
Regards,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
ElasticSearch

Another question, are the PHP clients running on the same nodes as
elasticsearch, or on different nodes (I assume on different nodes, just
want to make sure).

I have a suspicion that maybe netty, the networking library we use, does
not get around to actually close the connections because of the load on the
system. Trying to chase that one down with Trustin. When this happens, can
you gist a thread dump (jstack) just so we have it?

But, as suggested, the best thing to do is to use keep alive. nginx by the
way can be used to abstract that nicely as Karmi found out, see here:
https://gist.github.com/0a2b0e0df83813a4045f/d237b1f2425353edc3b13c593ee2f960dfae0fca
.

On Thu, Jun 28, 2012 at 9:38 AM, Rafał Kuć r.kuc@solr.pl wrote:
Hello!

Right now, after a few hours the number of CLOSE_WAIT connections are >
20k on all nodes in the cluster, which makes ElasticSearch unresponsive to
the API calls on 9200 port. However, clients using Java API are still
working without a problem.

*--
Regards,
Rafał Kuć
Sematext ::
http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

Hi Shay,

This is what we are using now:

network:
tcp:
keep_alive: true # also tried setting to false

timeout: 5s # this didn't help

threadpool:
search:
type: fixed # also tried blocking
size: 40
queue_size: 10
reject_policy: abort # also tried client

And ES sees this:

$ curl --silent '
11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true'
| egrep 'address|current|curr|server_open'
"transport_address" : "inet[/11.11.11.11:9300]",
"curr_estab" : 764,
"server_open" : 360,
"current_open" : 1053, <== this is approximately the
number of CLOSE_WAITs as shown by netstat -T

And this is where we see the threadpool numbers:

$ curl --silent '
11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true'
| egrep -C 4 search"

    "search" : {
      "threads" : 40,
      "queue" : 0,
      "active" : 6

The above is actually from a production environment we are trying to
stabilize.

So because of only 40 threads, we do see rejections:

[2012-06-27 21:23:51,541][WARN ][search.action ] [search 2]
Failed to send release search context
org.elasticsearch.transport.RemoteTransportException: [search
4][inet[/11.11.11.11:9300]][search/freeContext]
Caused by:
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException
at
org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:33)
at
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)

But CLOSE_WAITs are still piling up.

The tricky thing is that under a certain load (< 1200 QPS) CLOSE_WAITs are
nowhere to be found. The number of ESTABLISHED connections is more or less
constant and the number of CLOSE_WAITs is 0.
But a few minutes after we increase the load to > 1200 QPS we start seeing
CLOSE_WAITs and they just keep increasing.
Actually, now that I look at things, I see that even the number of
ESTABLISHED connections start to grow at some point, too. Not as fast as
CLOSE_WAITs, but growing.

Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.html

On Wednesday, June 27, 2012 6:37:49 PM UTC-4, Rafał Kuć wrote:
Hello!

We tried different settings on nodes - one of it was reject policy abort
with size 500 and 120, which resulted in CLOSE_WAIT's under a certain load.
Right now I've configured all the nodes in the cluster to 'reject_policy:
abort'.

As for the tests, I think it won't be a problem, but first, lets see how
ElasticSearch will behave with the current reject policy.

Ah, just to clarify things, we did try 0.19.3, 0.19.4 and now we are
running 0.19.7 :slight_smile:

*--
Thanks,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
ElasticSearch

  • Based on otis first post, he was using reject_policy of abort, can you
    clarify what reject_policy was used with the test with no keep alive that
    resulted in many CLOSE_WAIT?

  • If the CLOSE_WAIT is still a problem with reject policy of abort, can
    you run the test without keep alive and the two options I asked for?

On Thu, Jun 28, 2012 at 12:11 AM, Rafał Kuć r.kuc@solr.pl wrote:
Hello!

We didn't see any CLOSE_WAIT's while we were doing performance testing
using keep alive. I'll change reject policy to abort and will see how that
goes.

*--
Regards,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
ElasticSearch

First, can you make sure you ran your test with reject policy of abort
for the thread pool?
Second, Can you try two things:

  1. After you stop the load test, do you still have CLOSE_WAIT?
  2. If you run a single "client" load test, do you see CLOSE_WAIT?

-shay.banon

On Wed, Jun 27, 2012 at 11:12 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:
Hi Paul,

On Tuesday, June 26, 2012 2:50:19 AM UTC-4, Paul Brown wrote:

Hi, Otis --

The wikipedia article on TCP has a state chart that may be helpful:

http://en.wikipedia.org/http://en.wikipedia.org/wiki/Transmission_Control_Protocol
wiki/ http://en.wikipedia.org/wiki/Transmission_Control_Protocol
Transmission_Control_http://en.wikipedia.org/wiki/Transmission_Control_Protocol
Protocol http://en.wikipedia.org/wiki/Transmission_Control_Protocol

CLOSE_WAIT essentially means that the PHP app (libcurl of some sort, I
assume) hasn't done a full job of closing the connection, e.g., closing the
TCP connection but not the underlying socket, so that's where I'd look.
For example, the option CURLOPT_FORBID_REUSE might be useful.

Is that really so?
I looked at this diagram: http://en.wikipedia.http://en.wikipedia.org/wiki/File:TCP_CLOSE.svg
org/wiki/File:TCP_CLOSE.svghttp://en.wikipedia.org/wiki/File:TCP_CLOSE.svg
If I read that correctly, it looks like CLOSE_WAIT happens when the client
(left side) which initiated the connection issues a FIN, which I understand
as the client saying "I want to close this connection". After that FIN is
received by the server/receiver, that server/receiver goes into the
CLOSE_WAIT state and at that time it is supposed to answer by sending the
ACK and then (after some time?) by sending FIN going going into LAST_ACK
state and then, after client responds with ACK, into CLOSE state.

So if the server side is in CLOSE_WAIT, doesn't that mean that the server
received a FIN, but did not send ACK back to client?

Thanks,
Otis

Search Analytics - http://sematext.com/search-http://sematext.com/search-analytics/index.html
a http://sematext.com/search-analytics/index.htmlnalytics/index.htmlhttp://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/http://sematext.com/spm/index.html
index. http://sematext.com/spm/index.htmlhtmlhttp://sematext.com/spm/index.html

On Jun 25, 2012, at 8:26 PM, Otis Gospodnetic wrote:

Hello,

I've been fighting with ES that stops working after it gets hit for
several minutes by about 1200 QPS. This is happening with ES 0.19.3 and
0.19.4 on big boxes (24 cores, 96 GB RAM).

What seems to be happening is that after a while we start seeing more
and more and more CLOSE_WAIT connections between search clients (~500
frontend PHP apps) and ES, like this:

$ netstat -T | head
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address
State
tcp 325 0 11.11.11.11-static.reverse.softlayer.com:wap-wsp
184.184. http://184.184.184.184-static.reverse.softlayer.com:32035
184.184-static.http://184.184.184.184-static.reverse.softlayer.com:32035
reverse. http://184.184.184.184-static.reverse.softlayer.com:32035
softlayer.com:32035http://184.184.184.184-static.reverse.softlayer.com:32035
CLOSE_WAIT
...
...

The number of CLOSE_WAIT connections goes from being 0 for a while to
going up into thousands. And then at some point ES stops responding on
port 9200 (but still responds on port 9300).

The number of these CLOSE_WAIT connections seems to roughly correspond
to the "current_open" HTTP metric:

$ curl --silent 11.11.11.11:9200/_http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
cluster/http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
nodes/stats?network=http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
true&http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
transport=true&http=true&http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
thread_pool=true&indices=http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
falsehttp://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
&pretty=truehttp://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true'
| egrep 'address|current|curr|server_open'
"transport_address" : "inet[/ 11.11.11.11 :9300]",
"curr_estab" : 87,
"server_open" : 36,
"current_open" : 7,
<== healthy, just restarted
"transport_address" : "inet[/ 22.22.22.22 :9300]",
"curr_estab" : 1245,
"server_open" : 612,
"current_open" : 14,
<== healthy, just restarted
"transport_address" : "inet[/ 33.33.33.33:9300]",
"curr_estab" : 93,
"server_open" : 36,
"current_open" : 14,
<== healthy, just restarted
"transport_address" : "inet[/ 44.44.44.44 :9300]",
"curr_estab" : 171,
"server_open" : 36,
"current_open" : 15776,
<== baaad, not restarted

I tried using using threadpool (both fixed and blocking with both abort
and client rejection policies) to try stopping this "current_open" from
growing, e.g.:

threadpool:
search:
type: fixed
size: 120
queue_size: 100
reject_policy: abort

But that didn't help.

I should say that the search apps hitting this ES cluster are not using
persistent/keep-alive connections. And while this is clearly not ideal and
not efficient, I think it still shouldn't cause this "leak" that ends up
accumulating connections in CLOSE_WAIT state and eventually getting ES to
stop being responsive.

Is there anything one can do on the ES side to more aggressively close
connections?

Thanks,
Otis

Search Analytics - http://sematext.com/search-http://sematext.com/search-analytics/index.html
a http://sematext.com/search-analytics/index.htmlnalytics/index.htmlhttp://sematext.com/search-analytics/index.html

Scalable Performance Monitoring - http://sematext.com/spm/http://sematext.com/spm/index.html
index. http://sematext.com/spm/index.htmlhtmlhttp://sematext.com/spm/index.html


(Otis Gospodnetić) #14

Hi,

If it matters - this PHP application is really running on 500 front-end
servers. Each of them is connecting to ES independently. And on top of
that the PHP process is not persistent, so a new connection is created
(with PHP curl lib) for each query sent to ES. (not our design! :))

Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.html

On Friday, June 29, 2012 3:26:39 AM UTC-4, Rafał Kuć wrote:

Hello!

PHP application is running on different machines. We will gist the jstack
as soon as CLOSE_WAIT situation will happen again.

*--
Regards,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
ElasticSearch

Another question, are the PHP clients running on the same nodes as
elasticsearch, or on different nodes (I assume on different nodes, just
want to make sure).

I have a suspicion that maybe netty, the networking library we use, does
not get around to actually close the connections because of the load on the
system. Trying to chase that one down with Trustin. When this happens, can
you gist a thread dump (jstack) just so we have it?

But, as suggested, the best thing to do is to use keep alive. nginx by the
way can be used to abstract that nicely as Karmi found out, see here:
https://gist.github.com/0a2b0e0df83813a4045f/d237b1f2425353edc3b13c593ee2f960dfae0fca
.

On Thu, Jun 28, 2012 at 9:38 AM, Rafał Kuć r.kuc@solr.pl wrote:
Hello!

Right now, after a few hours the number of CLOSE_WAIT connections are >
20k on all nodes in the cluster, which makes ElasticSearch unresponsive to
the API calls on 9200 port. However, clients using Java API are still
working without a problem.

*--
Regards,
Rafał Kuć
Sematext ::
http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

Hi Shay,

This is what we are using now:

network:
tcp:
keep_alive: true # also tried setting to false

timeout: 5s # this didn't help

threadpool:
search:
type: fixed # also tried blocking
size: 40
queue_size: 10
reject_policy: abort # also tried client

And ES sees this:

$ curl --silent '
11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true'
| egrep 'address|current|curr|server_open'
"transport_address" : "inet[/11.11.11.11:9300]",
"curr_estab" : 764,
"server_open" : 360,
"current_open" : 1053, <== this is approximately the
number of CLOSE_WAITs as shown by netstat -T

And this is where we see the threadpool numbers:

$ curl --silent '
11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true'
| egrep -C 4 search"

    "search" : {
      "threads" : 40,
      "queue" : 0,
      "active" : 6

The above is actually from a production environment we are trying to
stabilize.

So because of only 40 threads, we do see rejections:

[2012-06-27 21:23:51,541][WARN ][search.action ] [search 2]
Failed to send release search context
org.elasticsearch.transport.RemoteTransportException: [search
4][inet[/11.11.11.11:9300]][search/freeContext]
Caused by:
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException
at
org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:33)
at
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)

But CLOSE_WAITs are still piling up.

The tricky thing is that under a certain load (< 1200 QPS) CLOSE_WAITs are
nowhere to be found. The number of ESTABLISHED connections is more or less
constant and the number of CLOSE_WAITs is 0.
But a few minutes after we increase the load to > 1200 QPS we start seeing
CLOSE_WAITs and they just keep increasing.
Actually, now that I look at things, I see that even the number of
ESTABLISHED connections start to grow at some point, too. Not as fast as
CLOSE_WAITs, but growing.

Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.html

On Wednesday, June 27, 2012 6:37:49 PM UTC-4, Rafał Kuć wrote:
Hello!

We tried different settings on nodes - one of it was reject policy abort
with size 500 and 120, which resulted in CLOSE_WAIT's under a certain load.
Right now I've configured all the nodes in the cluster to 'reject_policy:
abort'.

As for the tests, I think it won't be a problem, but first, lets see how
ElasticSearch will behave with the current reject policy.

Ah, just to clarify things, we did try 0.19.3, 0.19.4 and now we are
running 0.19.7 :slight_smile:

*--
Thanks,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
ElasticSearch

  • Based on otis first post, he was using reject_policy of abort, can you
    clarify what reject_policy was used with the test with no keep alive that
    resulted in many CLOSE_WAIT?

  • If the CLOSE_WAIT is still a problem with reject policy of abort, can
    you run the test without keep alive and the two options I asked for?

On Thu, Jun 28, 2012 at 12:11 AM, Rafał Kuć r.kuc@solr.pl wrote:
Hello!

We didn't see any CLOSE_WAIT's while we were doing performance testing
using keep alive. I'll change reject policy to abort and will see how that
goes.

*--
Regards,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
ElasticSearch

First, can you make sure you ran your test with reject policy of abort
for the thread pool?
Second, Can you try two things:

  1. After you stop the load test, do you still have CLOSE_WAIT?
  2. If you run a single "client" load test, do you see CLOSE_WAIT?

-shay.banon

On Wed, Jun 27, 2012 at 11:12 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:
Hi Paul,

On Tuesday, June 26, 2012 2:50:19 AM UTC-4, Paul Brown wrote:

Hi, Otis --

The wikipedia article on TCP has a state chart that may be helpful:

http://en.wikipedia.org/http://en.wikipedia.org/wiki/Transmission_Control_Protocol
wiki/ http://en.wikipedia.org/wiki/Transmission_Control_Protocol
Transmission_Control_http://en.wikipedia.org/wiki/Transmission_Control_Protocol
Protocol http://en.wikipedia.org/wiki/Transmission_Control_Protocol

CLOSE_WAIT essentially means that the PHP app (libcurl of some sort, I
assume) hasn't done a full job of closing the connection, e.g., closing the
TCP connection but not the underlying socket, so that's where I'd look.
For example, the option CURLOPT_FORBID_REUSE might be useful.

Is that really so?
I looked at this diagram: http://en.wikipedia.http://en.wikipedia.org/wiki/File:TCP_CLOSE.svg
org/wiki/File:TCP_CLOSE.svghttp://en.wikipedia.org/wiki/File:TCP_CLOSE.svg
If I read that correctly, it looks like CLOSE_WAIT happens when the client
(left side) which initiated the connection issues a FIN, which I understand
as the client saying "I want to close this connection". After that FIN is
received by the server/receiver, that server/receiver goes into the
CLOSE_WAIT state and at that time it is supposed to answer by sending the
ACK and then (after some time?) by sending FIN going going into LAST_ACK
state and then, after client responds with ACK, into CLOSE state.

So if the server side is in CLOSE_WAIT, doesn't that mean that the server
received a FIN, but did not send ACK back to client?

Thanks,
Otis

Search Analytics - http://sematext.com/search-http://sematext.com/search-analytics/index.html
a http://sematext.com/search-analytics/index.htmlnalytics/index.htmlhttp://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/http://sematext.com/spm/index.html
index. http://sematext.com/spm/index.htmlhtmlhttp://sematext.com/spm/index.html

On Jun 25, 2012, at 8:26 PM, Otis Gospodnetic wrote:

Hello,

I've been fighting with ES that stops working after it gets hit for
several minutes by about 1200 QPS. This is happening with ES 0.19.3 and
0.19.4 on big boxes (24 cores, 96 GB RAM).

What seems to be happening is that after a while we start seeing more
and more and more CLOSE_WAIT connections between search clients (~500
frontend PHP apps) and ES, like this:

$ netstat -T | head
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address
State
tcp 325 0 11.11.11.11-static.reverse.softlayer.com:wap-wsp
184.184. http://184.184.184.184-static.reverse.softlayer.com:32035
184.184-static.http://184.184.184.184-static.reverse.softlayer.com:32035
reverse. http://184.184.184.184-static.reverse.softlayer.com:32035
softlayer.com:32035http://184.184.184.184-static.reverse.softlayer.com:32035
CLOSE_WAIT
...
...

The number of CLOSE_WAIT connections goes from being 0 for a while to
going up into thousands. And then at some point ES stops responding on
port 9200 (but still responds on port 9300).

The number of these CLOSE_WAIT connections seems to roughly correspond
to the "current_open" HTTP metric:

$ curl --silent 11.11.11.11:9200/_http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
cluster/http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
nodes/stats?network=http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
true&http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
transport=true&http=true&http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
thread_pool=true&indices=http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
falsehttp://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
&pretty=truehttp://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true'
| egrep 'address|current|curr|server_open'
"transport_address" : "inet[/ 11.11.11.11 :9300]",
"curr_estab" : 87,
"server_open" : 36,
"current_open" : 7,
<== healthy, just restarted
"transport_address" : "inet[/ 22.22.22.22 :9300]",
"curr_estab" : 1245,
"server_open" : 612,
"current_open" : 14,
<== healthy, just restarted
"transport_address" : "inet[/ 33.33.33.33:9300]",
"curr_estab" : 93,
"server_open" : 36,
"current_open" : 14,
<== healthy, just restarted
"transport_address" : "inet[/ 44.44.44.44 :9300]",
"curr_estab" : 171,
"server_open" : 36,
"current_open" : 15776,
<== baaad, not restarted

I tried using using threadpool (both fixed and blocking with both abort
and client rejection policies) to try stopping this "current_open" from
growing, e.g.:

threadpool:
search:
type: fixed
size: 120
queue_size: 100
reject_policy: abort

But that didn't help.

I should say that the search apps hitting this ES cluster are not using
persistent/keep-alive connections. And while this is clearly not ideal and
not efficient, I think it still shouldn't cause this "leak" that ends up
accumulating connections in CLOSE_WAIT state and eventually getting ES to
stop being responsive.

Is there anything one can do on the ES side to more aggressively close
connections?

Thanks,
Otis

Search Analytics - http://sematext.com/search-http://sematext.com/search-analytics/index.html
a http://sematext.com/search-analytics/index.htmlnalytics/index.htmlhttp://sematext.com/search-analytics/index.html

Scalable Performance Monitoring - http://sematext.com/spm/http://sematext.com/spm/index.html
index. http://sematext.com/spm/index.htmlhtmlhttp://sematext.com/spm/index.html


(Shay Banon) #15

Ok, btw, did you manage to do the tests I asked for before?
Also, can you check what the raw HTTP request the PHP clients are sending.
A simple dump of a single request (especially the HTTP version passed and
the headers) would be enough.

On Fri, Jun 29, 2012 at 8:37 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

Hi,

If it matters - this PHP application is really running on 500 front-end
servers. Each of them is connecting to ES independently. And on top of
that the PHP process is not persistent, so a new connection is created
(with PHP curl lib) for each query sent to ES. (not our design! :))

Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.html

On Friday, June 29, 2012 3:26:39 AM UTC-4, Rafał Kuć wrote:

Hello!

PHP application is running on different machines. We will gist the jstack
as soon as CLOSE_WAIT situation will happen again.

*--
Regards,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
ElasticSearch

Another question, are the PHP clients running on the same nodes as
elasticsearch, or on different nodes (I assume on different nodes, just
want to make sure).

I have a suspicion that maybe netty, the networking library we use, does
not get around to actually close the connections because of the load on the
system. Trying to chase that one down with Trustin. When this happens, can
you gist a thread dump (jstack) just so we have it?

But, as suggested, the best thing to do is to use keep alive. nginx by
the way can be used to abstract that nicely as Karmi found out, see here:
https://gist.github.com/0a2b0e0df83813a4045f/
d237b1f2425353edc3b13c593ee2f9**60dfae0fcahttps://gist.github.com/0a2b0e0df83813a4045f/d237b1f2425353edc3b13c593ee2f960dfae0fca
.

On Thu, Jun 28, 2012 at 9:38 AM, Rafał Kuć r.kuc@solr.pl wrote:
Hello!

Right now, after a few hours the number of CLOSE_WAIT connections are >
20k on all nodes in the cluster, which makes ElasticSearch unresponsive to
the API calls on 9200 port. However, clients using Java API are still
working without a problem.

*--
Regards,
Rafał Kuć
Sematext ::
http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

Hi Shay,

This is what we are using now:

network:
tcp:
keep_alive: true # also tried setting to false

timeout: 5s # this didn't help

threadpool:
search:
type: fixed # also tried blocking
size: 40
queue_size: 10
reject_policy: abort # also tried client

And ES sees this:

$ curl --silent '11.11.11.11:9200/cluster/nodes/stats?network=true&
transport=true&http=true&**thread_pool=true&indices=**false&pretty=truehttp://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true'
| egrep 'address|current|curr|server
**open'
"transport_address" : "inet[/11.11.11.11:9300]",
"curr_estab" : 764,
"server_open" : 360,
"current_open" : 1053, <== this is approximately the
number of CLOSE_WAITs as shown by netstat -T

And this is where we see the threadpool numbers:

$ curl --silent '11.11.11.11:9200/_cluster/nodes/stats?network=true&
transport=true&http=true&**thread_pool=true&indices=**false&pretty=truehttp://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true'
| egrep -C 4 search"

    "search" : {
      "threads" : 40,
      "queue" : 0,
      "active" : 6

The above is actually from a production environment we are trying to
stabilize.

So because of only 40 threads, we do see rejections:

[2012-06-27 21:23:51,541][WARN ][search.action ] [search 2]
Failed to send release search context
org.elasticsearch.transport.**RemoteTransportException: [search
4][inet[/11.11.11.11:9300]][**search/freeContext]
Caused by: org.elasticsearch.common.util.concurrent.
EsRejectedExecutionException
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.
rejectedExecution(**EsAbortPolicy.java:33)
at java.util.concurrent.ThreadPoolExecutor.reject(
ThreadPoolExecutor.java:767)

But CLOSE_WAITs are still piling up.

The tricky thing is that under a certain load (< 1200 QPS) CLOSE_WAITs
are nowhere to be found. The number of ESTABLISHED connections is more or
less constant and the number of CLOSE_WAITs is 0.
But a few minutes after we increase the load to > 1200 QPS we start
seeing CLOSE_WAITs and they just keep increasing.
Actually, now that I look at things, I see that even the number of
ESTABLISHED connections start to grow at some point, too. Not as fast as
CLOSE_WAITs, but growing.

Otis

Search Analytics - http://sematext.com/search-**analytics/index.htmlhttp://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/**index.htmlhttp://sematext.com/spm/index.html

On Wednesday, June 27, 2012 6:37:49 PM UTC-4, Rafał Kuć wrote:
Hello!

We tried different settings on nodes - one of it was reject policy abort
with size 500 and 120, which resulted in CLOSE_WAIT's under a certain load.
Right now I've configured all the nodes in the cluster to 'reject_policy:
abort'.

As for the tests, I think it won't be a problem, but first, lets see how
ElasticSearch will behave with the current reject policy.

Ah, just to clarify things, we did try 0.19.3, 0.19.4 and now we are
running 0.19.7 :slight_smile:

*--
Thanks,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
ElasticSearch

  • Based on otis first post, he was using reject_policy of abort, can
    you clarify what reject_policy was used with the test with no keep alive
    that resulted in many CLOSE_WAIT?

  • If the CLOSE_WAIT is still a problem with reject policy of abort, can
    you run the test without keep alive and the two options I asked for?

On Thu, Jun 28, 2012 at 12:11 AM, Rafał Kuć r.kuc@solr.pl wrote:
Hello!

We didn't see any CLOSE_WAIT's while we were doing performance testing
using keep alive. I'll change reject policy to abort and will see how that
goes.

*--
Regards,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
ElasticSearch

First, can you make sure you ran your test with reject policy of abort
for the thread pool?
Second, Can you try two things:

  1. After you stop the load test, do you still have CLOSE_WAIT?
  2. If you run a single "client" load test, do you see CLOSE_WAIT?

-shay.banon

On Wed, Jun 27, 2012 at 11:12 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:
Hi Paul,

On Tuesday, June 26, 2012 2:50:19 AM UTC-4, Paul Brown wrote:

Hi, Otis --

The wikipedia article on TCP has a state chart that may be helpful:

http://en.wikipedia.org/http://en.wikipedia.org/wiki/Transmission_Control_Protocol
wiki**/ http://en.wikipedia.org/wiki/Transmission_Control_Protocol
Transmission_Control_http://en.wikipedia.org/wiki/Transmission_Control_Protocol
Protocol http://en.wikipedia.org/wiki/Transmission_Control_Protocol**

CLOSE_WAIT essentially means that the PHP app (libcurl of some sort, I
assume) hasn't done a full job of closing the connection, e.g., closing the
TCP connection but not the underlying socket, so that's where I'd look.
For example, the option CURLOPT_FORBID_REUSE might be useful.

Is that really so?
I looked at this diagram: http://en.wikipedia.http://en.wikipedia.org/wiki/File:TCP_CLOSE.svg
o**rg/wiki/File:TCP_CLOSE.svghttp://en.wikipedia.org/wiki/File:TCP_CLOSE.svg
If I read that correctly, it looks like CLOSE_WAIT happens when the
client (left side) which initiated the connection issues a FIN, which I
understand as the client saying "I want to close this connection". After
that FIN is received by the server/receiver, that server/receiver goes into
the CLOSE_WAIT state and at that time it is supposed to answer by sending
the ACK and then (after some time?) by sending FIN going going into
LAST_ACK state and then, after client responds with ACK, into CLOSE state.

So if the server side is in CLOSE_WAIT, doesn't that mean that the server
received a FIN, but did not send ACK back to client?

Thanks,
Otis

Search Analytics - http://sematext.com/search-http://sematext.com/search-analytics/index.html
a http://sematext.com/search-analytics/index.htmlnalytics/index.htmlhttp://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/http://sematext.com/spm/index.html
inde
x. http://sematext.com/spm/index.htmlhtmlhttp://sematext.com/spm/index.html

On Jun 25, 2012, at 8:26 PM, Otis Gospodnetic wrote:

Hello,

I've been fighting with ES that stops working after it gets hit for
several minutes by about 1200 QPS. This is happening with ES 0.19.3 and
0.19.4 on big boxes (24 cores, 96 GB RAM).

What seems to be happening is that after a while we start seeing more
and more and more CLOSE_WAIT connections between search clients (~500
frontend PHP apps) and ES, like this:

$ netstat -T | head
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address
State
tcp 325 0 11.11.11.11-static.reverse.softlayer.com:wap-wsp
184.184. http://184.184.184.184-static.reverse.softlayer.com:32035

184.184-static.http://184.184.184.184-static.reverse.softlayer.com:32035
reverse. http://184.184.184.184-static.reverse.softlayer.com:32035
softlay**er.com:32035http://184.184.184.184-static.reverse.softlayer.com:32035
CLOSE_WAIT
...
...

The number of CLOSE_WAIT connections goes from being 0 for a while to
going up into thousands. And then at some point ES stops responding on
port 9200 (but still responds on port 9300).

The number of these CLOSE_WAIT connections seems to roughly
correspond to the "current_open" HTTP metric:

$ curl --silent 11.11.11.11:9200/http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
clu**ster/http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
nodes/stats?network=http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
true&http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
transport=true&http=true&http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
threa
d_pool=true&indices=http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
falsehttp://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
&**pretty=truehttp://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true'
| egrep 'address|current|curr|server
**open'
"transport_address" : "inet[/ 11.11.11.11 :9300]",
"curr_estab" : 87,
"server_open" : 36,
"current_open" : 7,
<== healthy, just restarted
"transport_address" : "inet[/ 22.22.22.22 :9300]",
"curr_estab" : 1245,
"server_open" : 612,
"current_open" : 14,
<== healthy, just restarted
"transport_address" : "inet[/ 33.33.33.33:9300]",
"curr_estab" : 93,
"server_open" : 36,
"current_open" : 14,
<== healthy, just restarted
"transport_address" : "inet[/ 44.44.44.44 :9300]",
"curr_estab" : 171,
"server_open" : 36,
"current_open" : 15776,
<== baaad, not restarted

I tried using using threadpool (both fixed and blocking with both abort
and client rejection policies) to try stopping this "current_open" from
growing, e.g.:

threadpool:
search:
type: fixed
size: 120
queue_size: 100
reject_policy: abort

But that didn't help.

I should say that the search apps hitting this ES cluster are not using
persistent/keep-alive connections. And while this is clearly not ideal and
not efficient, I think it still shouldn't cause this "leak" that ends up
accumulating connections in CLOSE_WAIT state and eventually getting ES to
stop being responsive.

Is there anything one can do on the ES side to more aggressively close
connections?

Thanks,
Otis

Search Analytics - http://sematext.com/search-http://sematext.com/search-analytics/index.html
a http://sematext.com/search-analytics/index.html**nalytics/index.htmlhttp://sematext.com/search-analytics/index.html

Scalable Performance Monitoring - http://sematext.com/spm/http://sematext.com/spm/index.html
inde**x. http://sematext.com/spm/index.htmlhtmlhttp://sematext.com/spm/index.html


(Rafał Kuć) #16

Hi,

I've managed to do that test today. I've run the following query:

{"size":10,"from":0,"sort":["_score"],"fields":["_source"],"query":{"query_string":{"query":"high","default_field":"tags"}},"facets":{"tags":{"terms":{"field":"tags","size":40}}}}

in a loop to one of the nodes using a simple simple bash loop. After
running that query the number of CLOSE_WAIT's increased from 0 about 150 -
200, but slowly started to drop as the time was passing. Basically I
couldn't force a single machine running queries to cause CLOSE_WAIT's to be
killing the node.

One more thing we didn't say before - we have Thrift plugin installed and
working.

Thanks
Rafał

One thing I noticed, is when the load get bigger the

W dniu sobota, 30 czerwca 2012 18:35:00 UTC+2 użytkownik kimchy napisał:

Ok, btw, did you manage to do the tests I asked for before?
Also, can you check what the raw HTTP request the PHP clients are sending.
A simple dump of a single request (especially the HTTP version passed and
the headers) would be enough.

On Fri, Jun 29, 2012 at 8:37 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

Hi,

If it matters - this PHP application is really running on 500 front-end
servers. Each of them is connecting to ES independently. And on top of
that the PHP process is not persistent, so a new connection is created
(with PHP curl lib) for each query sent to ES. (not our design! :))

Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/index.html

On Friday, June 29, 2012 3:26:39 AM UTC-4, Rafał Kuć wrote:

Hello!

PHP application is running on different machines. We will gist the
jstack as soon as CLOSE_WAIT situation will happen again.

*--
Regards,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
ElasticSearch

Another question, are the PHP clients running on the same nodes as
elasticsearch, or on different nodes (I assume on different nodes, just
want to make sure).

I have a suspicion that maybe netty, the networking library we use, does
not get around to actually close the connections because of the load on the
system. Trying to chase that one down with Trustin. When this happens, can
you gist a thread dump (jstack) just so we have it?

But, as suggested, the best thing to do is to use keep alive. nginx by
the way can be used to abstract that nicely as Karmi found out, see here:
https://gist.github.com/0a2b0e0df83813a4045f/
d237b1f2425353edc3b13c593ee2f9**60dfae0fcahttps://gist.github.com/0a2b0e0df83813a4045f/d237b1f2425353edc3b13c593ee2f960dfae0fca
.

On Thu, Jun 28, 2012 at 9:38 AM, Rafał Kuć r.kuc@solr.pl wrote:
Hello!

Right now, after a few hours the number of CLOSE_WAIT connections are >
20k on all nodes in the cluster, which makes ElasticSearch unresponsive to
the API calls on 9200 port. However, clients using Java API are still
working without a problem.

*--
Regards,
Rafał Kuć
Sematext ::
http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

Hi Shay,

This is what we are using now:

network:
tcp:
keep_alive: true # also tried setting to false

timeout: 5s # this didn't help

threadpool:
search:
type: fixed # also tried blocking
size: 40
queue_size: 10
reject_policy: abort # also tried client

And ES sees this:

$ curl --silent '11.11.11.11:9200/cluster/nodes/stats?network=true&
transport=true&http=true&**thread_pool=true&indices=**false&pretty=truehttp://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true'
| egrep 'address|current|curr|server
**open'
"transport_address" : "inet[/11.11.11.11:9300]",
"curr_estab" : 764,
"server_open" : 360,
"current_open" : 1053, <== this is approximately the
number of CLOSE_WAITs as shown by netstat -T

And this is where we see the threadpool numbers:

$ curl --silent '11.11.11.11:9200/_cluster/nodes/stats?network=true&
transport=true&http=true&**thread_pool=true&indices=**false&pretty=truehttp://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true'
| egrep -C 4 search"

    "search" : {
      "threads" : 40,
      "queue" : 0,
      "active" : 6

The above is actually from a production environment we are trying to
stabilize.

So because of only 40 threads, we do see rejections:

[2012-06-27 21:23:51,541][WARN ][search.action ] [search 2]
Failed to send release search context
org.elasticsearch.transport.**RemoteTransportException: [search
4][inet[/11.11.11.11:9300]][**search/freeContext]
Caused by: org.elasticsearch.common.util.concurrent.
EsRejectedExecutionException
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.
rejectedExecution(**EsAbortPolicy.java:33)
at java.util.concurrent.ThreadPoolExecutor.reject(
ThreadPoolExecutor.java:767)

But CLOSE_WAITs are still piling up.

The tricky thing is that under a certain load (< 1200 QPS) CLOSE_WAITs
are nowhere to be found. The number of ESTABLISHED connections is more or
less constant and the number of CLOSE_WAITs is 0.
But a few minutes after we increase the load to > 1200 QPS we start
seeing CLOSE_WAITs and they just keep increasing.
Actually, now that I look at things, I see that even the number of
ESTABLISHED connections start to grow at some point, too. Not as fast as
CLOSE_WAITs, but growing.

Otis

Search Analytics - http://sematext.com/search-**analytics/index.htmlhttp://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/**index.htmlhttp://sematext.com/spm/index.html

On Wednesday, June 27, 2012 6:37:49 PM UTC-4, Rafał Kuć wrote:
Hello!

We tried different settings on nodes - one of it was reject policy abort
with size 500 and 120, which resulted in CLOSE_WAIT's under a certain load.
Right now I've configured all the nodes in the cluster to 'reject_policy:
abort'.

As for the tests, I think it won't be a problem, but first, lets see how
ElasticSearch will behave with the current reject policy.

Ah, just to clarify things, we did try 0.19.3, 0.19.4 and now we are
running 0.19.7 :slight_smile:

*--
Thanks,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
ElasticSearch

  • Based on otis first post, he was using reject_policy of abort, can
    you clarify what reject_policy was used with the test with no keep alive
    that resulted in many CLOSE_WAIT?

  • If the CLOSE_WAIT is still a problem with reject policy of abort, can
    you run the test without keep alive and the two options I asked for?

On Thu, Jun 28, 2012 at 12:11 AM, Rafał Kuć r.kuc@solr.pl wrote:
Hello!

We didn't see any CLOSE_WAIT's while we were doing performance testing
using keep alive. I'll change reject policy to abort and will see how that
goes.

*--
Regards,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
ElasticSearch

First, can you make sure you ran your test with reject policy of abort
for the thread pool?
Second, Can you try two things:

  1. After you stop the load test, do you still have CLOSE_WAIT?
  2. If you run a single "client" load test, do you see CLOSE_WAIT?

-shay.banon

On Wed, Jun 27, 2012 at 11:12 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:
Hi Paul,

On Tuesday, June 26, 2012 2:50:19 AM UTC-4, Paul Brown wrote:

Hi, Otis --

The wikipedia article on TCP has a state chart that may be helpful:

http://en.wikipedia.org/http://en.wikipedia.org/wiki/Transmission_Control_Protocol
wiki**/ http://en.wikipedia.org/wiki/Transmission_Control_Protocol
Transmission_Control_http://en.wikipedia.org/wiki/Transmission_Control_Protocol
Protocol http://en.wikipedia.org/wiki/Transmission_Control_Protocol**

CLOSE_WAIT essentially means that the PHP app (libcurl of some sort, I
assume) hasn't done a full job of closing the connection, e.g., closing the
TCP connection but not the underlying socket, so that's where I'd look.
For example, the option CURLOPT_FORBID_REUSE might be useful.

Is that really so?
I looked at this diagram: http://en.wikipedia.http://en.wikipedia.org/wiki/File:TCP_CLOSE.svg
o**rg/wiki/File:TCP_CLOSE.svghttp://en.wikipedia.org/wiki/File:TCP_CLOSE.svg
If I read that correctly, it looks like CLOSE_WAIT happens when the
client (left side) which initiated the connection issues a FIN, which I
understand as the client saying "I want to close this connection". After
that FIN is received by the server/receiver, that server/receiver goes into
the CLOSE_WAIT state and at that time it is supposed to answer by sending
the ACK and then (after some time?) by sending FIN going going into
LAST_ACK state and then, after client responds with ACK, into CLOSE state.

So if the server side is in CLOSE_WAIT, doesn't that mean that the
server received a FIN, but did not send ACK back to client?

Thanks,
Otis

Search Analytics - http://sematext.com/search-http://sematext.com/search-analytics/index.html
a http://sematext.com/search-analytics/index.htmlnalytics/index.htmlhttp://sematext.com/search-analytics/index.html
Scalable Performance Monitoring - http://sematext.com/spm/http://sematext.com/spm/index.html
inde
x. http://sematext.com/spm/index.htmlhtmlhttp://sematext.com/spm/index.html

On Jun 25, 2012, at 8:26 PM, Otis Gospodnetic wrote:

Hello,

I've been fighting with ES that stops working after it gets hit for
several minutes by about 1200 QPS. This is happening with ES 0.19.3 and
0.19.4 on big boxes (24 cores, 96 GB RAM).

What seems to be happening is that after a while we start seeing more
and more and more CLOSE_WAIT connections between search clients (~500
frontend PHP apps) and ES, like this:

$ netstat -T | head
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address
State
tcp 325 0 11.11.11.11-static.reverse.**
softlayer.com:wap-wsp 184.184.http://184.184.184.184-static.reverse.softlayer.com:32035
184.184-static.http://184.184.184.184-static.reverse.softlayer.com:32035
reverse. http://184.184.184.184-static.reverse.softlayer.com:32035
softlay
er.com:32035http://184.184.184.184-static.reverse.softlayer.com:32035
CLOSE_WAIT
...
...

The number of CLOSE_WAIT connections goes from being 0 for a while to
going up into thousands. And then at some point ES stops responding on
port 9200 (but still responds on port 9300).

The number of these CLOSE_WAIT connections seems to roughly
correspond to the "current_open" HTTP metric:

$ curl --silent 11.11.11.11:9200/http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
clu**ster/http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
nodes/stats?network=http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
true&http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
transport=true&http=true&http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
threa
d_pool=true&indices=http://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
falsehttp://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true
&**pretty=truehttp://11.11.11.11:9200/_cluster/nodes/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true'
| egrep 'address|current|curr|server
**open'
"transport_address" : "inet[/ 11.11.11.11 :9300]",
"curr_estab" : 87,
"server_open" : 36,
"current_open" : 7,
<== healthy, just restarted
"transport_address" : "inet[/ 22.22.22.22 :9300]",
"curr_estab" : 1245,
"server_open" : 612,
"current_open" : 14,
<== healthy, just restarted
"transport_address" : "inet[/ 33.33.33.33:9300]",
"curr_estab" : 93,
"server_open" : 36,
"current_open" : 14,
<== healthy, just restarted
"transport_address" : "inet[/ 44.44.44.44 :9300]",
"curr_estab" : 171,
"server_open" : 36,
"current_open" : 15776,
<== baaad, not restarted

I tried using using threadpool (both fixed and blocking with both
abort and client rejection policies) to try stopping this "current_open"
from growing, e.g.:

threadpool:
search:
type: fixed
size: 120
queue_size: 100
reject_policy: abort

But that didn't help.

I should say that the search apps hitting this ES cluster are not
using persistent/keep-alive connections. And while this is clearly not
ideal and not efficient, I think it still shouldn't cause this "leak" that
ends up accumulating connections in CLOSE_WAIT state and eventually getting
ES to stop being responsive.

Is there anything one can do on the ES side to more aggressively close
connections?

Thanks,
Otis

Search Analytics - http://sematext.com/search-http://sematext.com/search-analytics/index.html
a http://sematext.com/search-analytics/index.html**nalytics/index.htmlhttp://sematext.com/search-analytics/index.html

Scalable Performance Monitoring - http://sematext.com/spm/http://sematext.com/spm/index.html
inde**x. http://sematext.com/spm/index.htmlhtmlhttp://sematext.com/spm/index.html


(system) #17