Runaway ES TransportClient threads after ES node failure


(Paul Smith) #1

I posted this on IRC, but obviously my GMT+11 Timezone is not friendly, so
as a backup I post the text here for anyone that might have experience in
this:

I have an application using a TransportClient configured to connect to a
2-node ES cluster (i'll leave aside for now why we have to use the
TransportClient, but it's rationale..)

one of the ES nodes hand a faulty backplane and died.

ES of course kept on trucking with the other node

however since that event the application client has burnt a hell of a lot
of CPU

which looking at the thread dumps look to be the "New I/O client worker
#1-5 daemon" style threads used by ES.

I thought somehow with the one ES node dead there's some looping logic
trying to re-establish connection to it.

so I waited till the Dell guys replaced the backplane and we restored that
node

once back in green state I was hoping the CPU burn would go away, but alas
no.

now looking at one of our other instances running in a similar config, I
note the ES app threads are always runnable because of the NIO, but they're
generally in a sleep state looking at them.

has anyone else seen this sort of problem?

I'm just gathering a known 'good' thread dump to compare this with.

Here's a gist: https://gist.github.com/1440329


(Shay Banon) #2

Heya Paul, which version are you using? It sounds like a problem in netty
(the networking layer elasticsearch uses) which have been fixed in the
latest version (https://github.com/netty/netty/issues/74) and included in
the latest version of elasticsearch (0.18.5).

On Wed, Dec 7, 2011 at 12:34 AM, Paul Smith tallpsmith@gmail.com wrote:

I posted this on IRC, but obviously my GMT+11 Timezone is not friendly, so
as a backup I post the text here for anyone that might have experience in
this:

I have an application using a TransportClient configured to connect to a
2-node ES cluster (i'll leave aside for now why we have to use the
TransportClient, but it's rationale..)

one of the ES nodes hand a faulty backplane and died.

ES of course kept on trucking with the other node

however since that event the application client has burnt a hell of a lot
of CPU

which looking at the thread dumps look to be the "New I/O client worker
#1-5 daemon" style threads used by ES.

I thought somehow with the one ES node dead there's some looping logic
trying to re-establish connection to it.

so I waited till the Dell guys replaced the backplane and we restored that
node

once back in green state I was hoping the CPU burn would go away, but alas
no.

now looking at one of our other instances running in a similar config, I
note the ES app threads are always runnable because of the NIO, but they're
generally in a sleep state looking at them.

has anyone else seen this sort of problem?

I'm just gathering a known 'good' thread dump to compare this with.

Here's a gist: https://gist.github.com/1440329


(Paul Smith) #3

Oh geez bad form by me not quoting the version. Yes. 0.17.9 is what we're
using. I'm planning on upgrading to 0.18.x in the next month so that's good
news.

Thanks Shay.

On Thursday, 8 December 2011, Shay Banon kimchy@gmail.com wrote:

Heya Paul, which version are you using? It sounds like a problem in netty
(the networking layer elasticsearch uses) which have been fixed in the
latest version (https://github.com/netty/netty/issues/74) and included in
the latest version of elasticsearch (0.18.5).

On Wed, Dec 7, 2011 at 12:34 AM, Paul Smith tallpsmith@gmail.com wrote:

I posted this on IRC, but obviously my GMT+11 Timezone is not friendly,
so as a backup I post the text here for anyone that might have experience
in this:

I have an application using a TransportClient configured to connect to a
2-node ES cluster (i'll leave aside for now why we have to use the
TransportClient, but it's rationale..)

one of the ES nodes hand a faulty backplane and died.
ES of course kept on trucking with the other node
however since that event the application client has burnt a hell of a
lot of CPU

which looking at the thread dumps look to be the "New I/O client worker
#1-5 daemon" style threads used by ES.

I thought somehow with the one ES node dead there's some looping logic
trying to re-establish connection to it.

so I waited till the Dell guys replaced the backplane and we restored
that node

once back in green state I was hoping the CPU burn would go away, but
alas no.

now looking at one of our other instances running in a similar config, I
note the ES app threads are always runnable because of the NIO, but they're
generally in a sleep state looking at them.

has anyone else seen this sort of problem?
I'm just gathering a known 'good' thread dump to compare this with.
Here's a gist: https://gist.github.com/1440329


(system) #4