Total threads in use increases without bound until node crashes

Ben_Siemon · May 23, 2013, 8:52pm

I have a single node setup on a development server.

VM name: Java HotSpot(TM) 64-Bit Server VM
VM vendor: Sun Microsystems Inc.
VM version: 16.3-b01
Java version: 1.6.0_20

ES Version 0.90.0

When I start the node the thread count begins to rise until the node is
unable to create any new threads due to OOM errors. I am not doing any
queries/indexing while this happens.

I am viewing the total thread count through the bigdesk plugin.

Most of my active threads are in the generic group which is set to cached.

Is there a way for me to see where all these concurrent requests are
coming from and why they are being served from the generic group?
What is the generic thread pool for?
Could these management plugins be causing this overload?

Json output of the in use threads

thread_pool: {
generic: {
threads: 268
queue: 0
active: 266
rejected: 0
largest: 268
completed: 2896
}
index: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
get: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
snapshot: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}
merge: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}
bulk: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
warmer: {
threads: 1
queue: 0
active: 0
largest: 1
completed: 22
}
flush: {
threads: 1
queue: 0
active: 0
largest: 1
completed: 1
}
search: {
threads: 5
queue: 0
active: 0
rejected: 0
largest: 5
completed: 5
}
percolate: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
management: {
threads: 3
queue: 0
active: 1
largest: 3
completed: 2102
}
refresh: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Roy_Russo · May 24, 2013, 1:17am

Try using bigdesk without plugin (mode) and simply the app. That would
eliminate your question about whether the plugin is causing the situation
(which I highly doubt)

Alternatively (shameless plug), use elastichq.org Both should report the
same thing via REST APIs and not cause overhead.

On Thursday, May 23, 2013 4:52:59 PM UTC-4, Ben Siemon wrote:

I have a single node setup on a development server.

VM name: Java HotSpot(TM) 64-Bit Server VM
VM vendor: Sun Microsystems Inc.
VM version: 16.3-b01
Java version: 1.6.0_20

ES Version 0.90.0

When I start the node the thread count begins to rise until the node is
unable to create any new threads due to OOM errors. I am not doing any
queries/indexing while this happens.

I am viewing the total thread count through the bigdesk plugin.

Most of my active threads are in the generic group which is set to cached.

Is there a way for me to see where all these concurrent requests are
coming from and why they are being served from the generic group?
What is the generic thread pool for?
Could these management plugins be causing this overload?

Json output of the in use threads

thread_pool: {
generic: {
threads: 268
queue: 0
active: 266
rejected: 0
largest: 268
completed: 2896
}
index: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
get: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
snapshot: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}
merge: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}
bulk: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
warmer: {
threads: 1
queue: 0
active: 0
largest: 1
completed: 22
}
flush: {
threads: 1
queue: 0
active: 0
largest: 1
completed: 1
}
search: {
threads: 5
queue: 0
active: 0
rejected: 0
largest: 5
completed: 5
}
percolate: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
management: {
threads: 3
queue: 0
active: 1
largest: 3
completed: 2102
}
refresh: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Roy_Russo · May 24, 2013, 1:18am

Forgot to mention... you may want to also look at active threads:

On Thursday, May 23, 2013 4:52:59 PM UTC-4, Ben Siemon wrote:

I have a single node setup on a development server.

VM name: Java HotSpot(TM) 64-Bit Server VM
VM vendor: Sun Microsystems Inc.
VM version: 16.3-b01
Java version: 1.6.0_20

ES Version 0.90.0

When I start the node the thread count begins to rise until the node is
unable to create any new threads due to OOM errors. I am not doing any
queries/indexing while this happens.

I am viewing the total thread count through the bigdesk plugin.

Most of my active threads are in the generic group which is set to cached.

Is there a way for me to see where all these concurrent requests are
coming from and why they are being served from the generic group?
What is the generic thread pool for?
Could these management plugins be causing this overload?

Json output of the in use threads

thread_pool: {
generic: {
threads: 268
queue: 0
active: 266
rejected: 0
largest: 268
completed: 2896
}
index: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
get: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
snapshot: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}
merge: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}
bulk: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
warmer: {
threads: 1
queue: 0
active: 0
largest: 1
completed: 22
}
flush: {
threads: 1
queue: 0
active: 0
largest: 1
completed: 1
}
search: {
threads: 5
queue: 0
active: 0
rejected: 0
largest: 5
completed: 5
}
percolate: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
management: {
threads: 3
queue: 0
active: 1
largest: 3
completed: 2102
}
refresh: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

simonw_2 · May 24, 2013, 8:26am

you are running on a very old Java version Java version: 1.6.0_20 that is
subject to break a lot of things. can you please update to a new VM and try
to reproduce the error.The outputs Roy mentioned might still be very
helpful.

thanks

On Friday, May 24, 2013 3:18:06 AM UTC+2, Roy Russo wrote:

Forgot to mention... you may want to also look at active threads:
Elasticsearch Platform — Find real-time answers at scale | Elastic

On Thursday, May 23, 2013 4:52:59 PM UTC-4, Ben Siemon wrote:

I have a single node setup on a development server.

VM name: Java HotSpot(TM) 64-Bit Server VM
VM vendor: Sun Microsystems Inc.
VM version: 16.3-b01
Java version: 1.6.0_20

ES Version 0.90.0

When I start the node the thread count begins to rise until the node is
unable to create any new threads due to OOM errors. I am not doing any
queries/indexing while this happens.

I am viewing the total thread count through the bigdesk plugin.

Most of my active threads are in the generic group which is set to cached.

Is there a way for me to see where all these concurrent requests are
coming from and why they are being served from the generic group?
What is the generic thread pool for?
Could these management plugins be causing this overload?

Json output of the in use threads

thread_pool: {
generic: {
threads: 268
queue: 0
active: 266
rejected: 0
largest: 268
completed: 2896
}
index: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
get: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
snapshot: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}
merge: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}
bulk: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
warmer: {
threads: 1
queue: 0
active: 0
largest: 1
completed: 22
}
flush: {
threads: 1
queue: 0
active: 0
largest: 1
completed: 1
}
search: {
threads: 5
queue: 0
active: 0
rejected: 0
largest: 5
completed: 5
}
percolate: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
management: {
threads: 3
queue: 0
active: 1
largest: 3
completed: 2102
}
refresh: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ben_Siemon · May 24, 2013, 5:43pm

Unfortunately I am unable to change the java version we use on any tier
without a great deal of hassle. We dont see this behavior on 0.20 on the
same machine configurations.

Are there changes in 0.90 w/r/t to the generic thread pool that would be
sensitive to VM changes?

It seems like downgrading to 0.20 is the only option for me.

On Fri, May 24, 2013 at 4:26 AM, simonw
simon.willnauer@elasticsearch.comwrote:

you are running on a very old Java version Java version: 1.6.0_20 that is
subject to break a lot of things. can you please update to a new VM and try
to reproduce the error.The outputs Roy mentioned might still be very
helpful.

thanks

On Friday, May 24, 2013 3:18:06 AM UTC+2, Roy Russo wrote:

Forgot to mention... you may want to also look at active threads:
http://www.**Elasticsearch Platform — Find real-time answers at scale | Elasticreference/api/admin-cluster-
nodes-hot-threads/http://www.elasticsearch.org/guide/reference/api/admin-cluster-nodes-hot-threads/

On Thursday, May 23, 2013 4:52:59 PM UTC-4, Ben Siemon wrote:

I have a single node setup on a development server.

VM name: Java HotSpot(TM) 64-Bit Server VM
VM vendor: Sun Microsystems Inc.
VM version: 16.3-b01
Java version: 1.6.0_20

ES Version 0.90.0

When I start the node the thread count begins to rise until the node is
unable to create any new threads due to OOM errors. I am not doing any
queries/indexing while this happens.

I am viewing the total thread count through the bigdesk plugin.

Most of my active threads are in the generic group which is set to
cached.

Is there a way for me to see where all these concurrent requests are
coming from and why they are being served from the generic group?
What is the generic thread pool for?
Could these management plugins be causing this overload?

Json output of the in use threads

thread_pool: {
generic: {
threads: 268
queue: 0
active: 266
rejected: 0
largest: 268
completed: 2896
}
index: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
get: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
snapshot: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}
merge: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}
bulk: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
warmer: {
threads: 1
queue: 0
active: 0
largest: 1
completed: 22
}
flush: {
threads: 1
queue: 0
active: 0
largest: 1
completed: 1
}
search: {
threads: 5
queue: 0
active: 0
rejected: 0
largest: 5
completed: 5
}
percolate: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
management: {
threads: 3
queue: 0
active: 1
largest: 3
completed: 2102
}
refresh: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

simonw_2 · May 24, 2013, 7:30pm

Ben, can you please get me the output of hot threads? I really want to
track this down. Can you tell what kind of plugins you have installed?

there must be a way to figure out what is going on on your side, please
provide us some insight.

simon

On Friday, May 24, 2013 7:43:55 PM UTC+2, Ben Siemon wrote:

Unfortunately I am unable to change the java version we use on any tier
without a great deal of hassle. We dont see this behavior on 0.20 on the
same machine configurations.

Are there changes in 0.90 w/r/t to the generic thread pool that would be
sensitive to VM changes?

It seems like downgrading to 0.20 is the only option for me.

On Fri, May 24, 2013 at 4:26 AM, simonw <simon.w...@elasticsearch.com<javascript:>

wrote:

you are running on a very old Java version Java version: 1.6.0_20 that
is subject to break a lot of things. can you please update to a new VM and
try to reproduce the error.The outputs Roy mentioned might still be very
helpful.

thanks

On Friday, May 24, 2013 3:18:06 AM UTC+2, Roy Russo wrote:

Forgot to mention... you may want to also look at active threads:
http://www.**Elasticsearch Platform — Find real-time answers at scale | Elasticreference/api/admin-cluster-
nodes-hot-threads/http://www.elasticsearch.org/guide/reference/api/admin-cluster-nodes-hot-threads/

On Thursday, May 23, 2013 4:52:59 PM UTC-4, Ben Siemon wrote:

I have a single node setup on a development server.

VM name: Java HotSpot(TM) 64-Bit Server VM
VM vendor: Sun Microsystems Inc.
VM version: 16.3-b01
Java version: 1.6.0_20

ES Version 0.90.0

When I start the node the thread count begins to rise until the node is
unable to create any new threads due to OOM errors. I am not doing any
queries/indexing while this happens.

I am viewing the total thread count through the bigdesk plugin.

Most of my active threads are in the generic group which is set to
cached.

Is there a way for me to see where all these concurrent requests are
coming from and why they are being served from the generic group?
What is the generic thread pool for?
Could these management plugins be causing this overload?

Json output of the in use threads

thread_pool: {
generic: {
threads: 268
queue: 0
active: 266
rejected: 0
largest: 268
completed: 2896
}
index: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
get: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
snapshot: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}
merge: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}
bulk: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
warmer: {
threads: 1
queue: 0
active: 0
largest: 1
completed: 22
}
flush: {
threads: 1
queue: 0
active: 0
largest: 1
completed: 1
}
search: {
threads: 5
queue: 0
active: 0
rejected: 0
largest: 5
completed: 5
}
percolate: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
management: {
threads: 3
queue: 0
active: 1
largest: 3
completed: 2102
}
refresh: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ben_Siemon · May 24, 2013, 7:44pm

2.0% (10ms out of 500ms) cpu usage by thread
'elasticsearch[apdv001.va.opower.it][http_server_worker][T#9]{New I/O
worker #107}'
10/10 snapshots sharing following 15 elements
sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
org.elasticsearch.common.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:64)
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.select(AbstractNioSelector.java:409)
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:206)
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:88)
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:619)

0.0% (0s out of 500ms) cpu usage by thread 'Reference Handler'
 10/10 snapshots sharing following 3 elements
   java.lang.Object.wait(Native Method)
   java.lang.Object.wait(Object.java:485)
   java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)

0.0% (0s out of 500ms) cpu usage by thread 'Finalizer'
 10/10 snapshots sharing following 4 elements
   java.lang.Object.wait(Native Method)
   java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
   java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
   java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

This is ~7 minutes after a restart. The thread count has already increased
from 162 to 226.

I have bigdesk and head installed for plugins.

As part of my own investigation I downloaded and installed 0.90.0 to my
home directory on this dev tier app server. Then I ran elasticsearch -f to
see if it would exhibit the same behavior. Strangely it did not. This seems
to rule out any java version problems. We installed the instance with the
thread growth problem via puppet and the provided rpm. I am going to
continue to look at what the differences are.

What is used by the generic thread pool within the ES daemon? This
'generic' pool is the one that is growing and the only one that is even
getting work since I have no index or query operations running presently.

Thank you very much for your help!

On Fri, May 24, 2013 at 3:30 PM, simonw
simon.willnauer@elasticsearch.comwrote:

Ben, can you please get me the output of hot threads? I really want to
track this down. Can you tell what kind of plugins you have installed?

there must be a way to figure out what is going on on your side, please
provide us some insight.

simon

On Friday, May 24, 2013 7:43:55 PM UTC+2, Ben Siemon wrote:

Unfortunately I am unable to change the java version we use on any tier
without a great deal of hassle. We dont see this behavior on 0.20 on the
same machine configurations.

Are there changes in 0.90 w/r/t to the generic thread pool that would be
sensitive to VM changes?

It seems like downgrading to 0.20 is the only option for me.

On Fri, May 24, 2013 at 4:26 AM, simonw <simon.w...@**elasticsearch.com>wrote:

you are running on a very old Java version Java version: 1.6.0_20 that
is subject to break a lot of things. can you please update to a new VM and
try to reproduce the error.The outputs Roy mentioned might still be very
helpful.

thanks

On Friday, May 24, 2013 3:18:06 AM UTC+2, Roy Russo wrote:

Forgot to mention... you may want to also look at active threads:
http://www.elasticsearch.org/guide/**reference/api/**admin-cluster-
**nodes-hot-**threads/http://www.elasticsearch.org/guide/reference/api/admin-cluster-nodes-hot-threads/

On Thursday, May 23, 2013 4:52:59 PM UTC-4, Ben Siemon wrote:

I have a single node setup on a development server.

VM name: Java HotSpot(TM) 64-Bit Server VM
VM vendor: Sun Microsystems Inc.
VM version: 16.3-b01
Java version: 1.6.0_20

ES Version 0.90.0

When I start the node the thread count begins to rise until the node
is unable to create any new threads due to OOM errors. I am not doing any
queries/indexing while this happens.

I am viewing the total thread count through the bigdesk plugin.

Most of my active threads are in the generic group which is set to
cached.

Is there a way for me to see where all these concurrent requests are
coming from and why they are being served from the generic group?
What is the generic thread pool for?
Could these management plugins be causing this overload?

Json output of the in use threads

thread_pool: {
generic: {
threads: 268
queue: 0
active: 266
rejected: 0
largest: 268
completed: 2896
}
index: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
get: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
snapshot: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}
merge: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}
bulk: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
warmer: {
threads: 1
queue: 0
active: 0
largest: 1
completed: 22
}
flush: {
threads: 1
queue: 0
active: 0
largest: 1
completed: 1
}
search: {
threads: 5
queue: 0
active: 0
rejected: 0
largest: 5
completed: 5
}
percolate: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
management: {
threads: 3
queue: 0
active: 1
largest: 3
completed: 2102
}
refresh: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**Wqr7Cb5ZEhU/unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

simonw_2 · May 24, 2013, 9:03pm

this looks pretty much ok to me. Yet, what would be interesting to me is to
see a thread dump of the node in question. Given that a vanilla started
0.90 node doesn't have the problem is a good thing IMO. Lets track this
down further. Can you produce a thread dump using jstack <pid> >> threaddumps.log?
while this is a blind shot, I guess that something blocks and creates new
threads all the time so I am wondering what they wait on and the thread
dump should tell us.

thanks,

simon

On Friday, May 24, 2013 9:44:02 PM UTC+2, Ben Siemon wrote:

2.0% (10ms out of 500ms) cpu usage by thread 'elasticsearch[apdv001.va.opower.it][http_server_worker][T#9]{New I/O worker #107}'
10/10 snapshots sharing following 15 elements
sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
org.elasticsearch.common.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:64)
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.select(AbstractNioSelector.java:409)
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:206)
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:88)
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:619)
0.0% (0s out of 500ms) cpu usage by thread 'Reference Handler'
 10/10 snapshots sharing following 3 elements
   java.lang.Object.wait(Native Method)
   java.lang.Object.wait(Object.java:485)
   java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)

0.0% (0s out of 500ms) cpu usage by thread 'Finalizer'
 10/10 snapshots sharing following 4 elements
   java.lang.Object.wait(Native Method)
   java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
   java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
   java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
This is ~7 minutes after a restart. The thread count has already increased
from 162 to 226.

I have bigdesk and head installed for plugins.

As part of my own investigation I downloaded and installed 0.90.0 to my
home directory on this dev tier app server. Then I ran elasticsearch -f to
see if it would exhibit the same behavior. Strangely it did not. This seems
to rule out any java version problems. We installed the instance with the
thread growth problem via puppet and the provided rpm. I am going to
continue to look at what the differences are.

What is used by the generic thread pool within the ES daemon? This
'generic' pool is the one that is growing and the only one that is even
getting work since I have no index or query operations running presently.

Thank you very much for your help!

On Fri, May 24, 2013 at 3:30 PM, simonw <simon.w...@elasticsearch.com<javascript:>

wrote:

Ben, can you please get me the output of hot threads? I really want to
track this down. Can you tell what kind of plugins you have installed?

there must be a way to figure out what is going on on your side, please
provide us some insight.

simon

On Friday, May 24, 2013 7:43:55 PM UTC+2, Ben Siemon wrote:

Unfortunately I am unable to change the java version we use on any tier
without a great deal of hassle. We dont see this behavior on 0.20 on the
same machine configurations.

Are there changes in 0.90 w/r/t to the generic thread pool that would be
sensitive to VM changes?

It seems like downgrading to 0.20 is the only option for me.

On Fri, May 24, 2013 at 4:26 AM, simonw <simon.w...@**elasticsearch.com>wrote:

you are running on a very old Java version Java version: 1.6.0_20 that
is subject to break a lot of things. can you please update to a new VM and
try to reproduce the error.The outputs Roy mentioned might still be very
helpful.

thanks

On Friday, May 24, 2013 3:18:06 AM UTC+2, Roy Russo wrote:

Forgot to mention... you may want to also look at active threads:
http://www.elasticsearch.org/guide/reference/api/
admin-cluster-**nodes-hot-**threads/http://www.elasticsearch.org/guide/reference/api/admin-cluster-nodes-hot-threads/

On Thursday, May 23, 2013 4:52:59 PM UTC-4, Ben Siemon wrote:

I have a single node setup on a development server.

VM name: Java HotSpot(TM) 64-Bit Server VM
VM vendor: Sun Microsystems Inc.
VM version: 16.3-b01
Java version: 1.6.0_20

ES Version 0.90.0

When I start the node the thread count begins to rise until the node
is unable to create any new threads due to OOM errors. I am not doing any
queries/indexing while this happens.

I am viewing the total thread count through the bigdesk plugin.

Most of my active threads are in the generic group which is set to
cached.

Is there a way for me to see where all these concurrent requests are
coming from and why they are being served from the generic group?
What is the generic thread pool for?
Could these management plugins be causing this overload?

Json output of the in use threads

thread_pool: {
generic: {
threads: 268
queue: 0
active: 266
rejected: 0
largest: 268
completed: 2896
}
index: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
get: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
snapshot: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}
merge: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}
bulk: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
warmer: {
threads: 1
queue: 0
active: 0
largest: 1
completed: 22
}
flush: {
threads: 1
queue: 0
active: 0
largest: 1
completed: 1
}
search: {
threads: 5
queue: 0
active: 0
rejected: 0
largest: 5
completed: 5
}
percolate: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
management: {
threads: 3
queue: 0
active: 1
largest: 3
completed: 2102
}
refresh: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**Wqr7Cb5ZEhU/unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ben_Siemon · May 24, 2013, 9:51pm

thread dump attached. It is a little long from all the active threads. I
figured I would do the dump after it had been running for a while to get as
much detail as possible.

On Fri, May 24, 2013 at 5:03 PM, simonw
simon.willnauer@elasticsearch.comwrote:

this looks pretty much ok to me. Yet, what would be interesting to me is
to see a thread dump of the node in question. Given that a vanilla started
0.90 node doesn't have the problem is a good thing IMO. Lets track this
down further. Can you produce a thread dump using jstack <pid> >> threaddumps.log?
while this is a blind shot, I guess that something blocks and creates new
threads all the time so I am wondering what they wait on and the thread
dump should tell us.

thanks,

simon

On Friday, May 24, 2013 9:44:02 PM UTC+2, Ben Siemon wrote:
2.0% (10ms out of 500ms) cpu usage by thread 'elasticsearch[apdv001.va.opower.it http://apdv001.va.opower.it][http_server_worker][T#9]{New I/O worker #107}'
10/10 snapshots sharing following 15 elements
sun.nio.ch.EPollArrayWrapper.**epollWait(Native Method)
sun.nio.ch.EPollArrayWrapper.**poll(EPollArrayWrapper.java:**210)
sun.nio.ch.EPollSelectorImpl.**doSelect(EPollSelectorImpl.**java:65)
sun.nio.ch.SelectorImpl.**lockAndDoSelect(SelectorImpl.**java:69)
sun.nio.ch.SelectorImpl.**select(SelectorImpl.java:80)
org.elasticsearch.common.**netty.channel.socket.nio.**SelectorUtil.select(**SelectorUtil.java:64)
org.elasticsearch.common.**netty.channel.socket.nio.**AbstractNioSelector.select(**AbstractNioSelector.java:409)
org.elasticsearch.common.**netty.channel.socket.nio.**AbstractNioSelector.run(**AbstractNioSelector.java:206)
org.elasticsearch.common.**netty.channel.socket.nio.**AbstractNioWorker.run(**AbstractNioWorker.java:88)
org.elasticsearch.common.**netty.channel.socket.nio.**NioWorker.run(NioWorker.java:**178)
org.elasticsearch.common.**netty.util.**ThreadRenamingRunnable.run(**ThreadRenamingRunnable.java:**108)
org.elasticsearch.common.**netty.util.internal.**DeadLockProofWorker$1.run(**DeadLockProofWorker.java:42)
java.util.concurrent.**ThreadPoolExecutor$Worker.**runTask(ThreadPoolExecutor.**java:886)
java.util.concurrent.**ThreadPoolExecutor$Worker.run(**ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.**java:619)
0.0% (0s out of 500ms) cpu usage by thread 'Reference Handler'
 10/10 snapshots sharing following 3 elements
   java.lang.Object.wait(Native Method)
   java.lang.Object.wait(Object.**java:485)
   java.lang.ref.Reference$**ReferenceHandler.run(**Reference.java:116)

0.0% (0s out of 500ms) cpu usage by thread 'Finalizer'
 10/10 snapshots sharing following 4 elements
   java.lang.Object.wait(Native Method)
   java.lang.ref.ReferenceQueue.**remove(ReferenceQueue.java:**118)
   java.lang.ref.ReferenceQueue.**remove(ReferenceQueue.java:**134)
   java.lang.ref.Finalizer$**FinalizerThread.run(Finalizer.**java:159)
This is ~7 minutes after a restart. The thread count has already
increased from 162 to 226.

I have bigdesk and head installed for plugins.

As part of my own investigation I downloaded and installed 0.90.0 to my
home directory on this dev tier app server. Then I ran elasticsearch -f to
see if it would exhibit the same behavior. Strangely it did not. This seems
to rule out any java version problems. We installed the instance with the
thread growth problem via puppet and the provided rpm. I am going to
continue to look at what the differences are.

What is used by the generic thread pool within the ES daemon? This
'generic' pool is the one that is growing and the only one that is even
getting work since I have no index or query operations running presently.

Thank you very much for your help!

On Fri, May 24, 2013 at 3:30 PM, simonw <simon.w...@**elasticsearch.com>wrote:

Ben, can you please get me the output of hot threads? I really want to
track this down. Can you tell what kind of plugins you have installed?

there must be a way to figure out what is going on on your side, please
provide us some insight.

simon

On Friday, May 24, 2013 7:43:55 PM UTC+2, Ben Siemon wrote:

Unfortunately I am unable to change the java version we use on any tier
without a great deal of hassle. We dont see this behavior on 0.20 on the
same machine configurations.

Are there changes in 0.90 w/r/t to the generic thread pool that would
be sensitive to VM changes?

It seems like downgrading to 0.20 is the only option for me.

On Fri, May 24, 2013 at 4:26 AM, simonw <simon.w...@**elasticsearch.com

wrote:

you are running on a very old Java version Java version: 1.6.0_20
that is subject to break a lot of things. can you please update to a new VM
and try to reproduce the error.The outputs Roy mentioned might still be
very helpful.

thanks

On Friday, May 24, 2013 3:18:06 AM UTC+2, Roy Russo wrote:

Forgot to mention... you may want to also look at active threads:
http://www.**elasticsea****rch.org/guide/**reference/api/ad
min-cluster-**nodes-hot-**threads/http://www.elasticsearch.org/guide/reference/api/admin-cluster-nodes-hot-threads/

On Thursday, May 23, 2013 4:52:59 PM UTC-4, Ben Siemon wrote:

I have a single node setup on a development server.

VM name: Java HotSpot(TM) 64-Bit Server VM
VM vendor: Sun Microsystems Inc.
VM version: 16.3-b01
Java version: 1.6.0_20

ES Version 0.90.0

When I start the node the thread count begins to rise until the node
is unable to create any new threads due to OOM errors. I am not doing any
queries/indexing while this happens.

I am viewing the total thread count through the bigdesk plugin.

Most of my active threads are in the generic group which is set to
cached.

Is there a way for me to see where all these concurrent requests
are coming from and why they are being served from the generic group?
What is the generic thread pool for?
Could these management plugins be causing this overload?

Json output of the in use threads

thread_pool: {
generic: {
threads: 268
queue: 0
active: 266
rejected: 0
largest: 268
completed: 2896
}
index: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
get: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
snapshot: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}
merge: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}
bulk: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
warmer: {
threads: 1
queue: 0
active: 0
largest: 1
completed: 22
}
flush: {
threads: 1
queue: 0
active: 0
largest: 1
completed: 1
}
search: {
threads: 5
queue: 0
active: 0
rejected: 0
largest: 5
completed: 5
}
percolate: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
management: {
threads: 3
queue: 0
active: 1
largest: 3
completed: 2102
}
refresh: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**to
**pic/elasticsearch/**Wqr7Cb5ZEhU/**unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.**com.

For more options, visit https://groups.google.com/**grou**ps/opt_out https://groups.google.com/groups/opt_out
.

--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**Wqr7Cb5ZEhU/unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.
--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

simonw_2 · May 25, 2013, 6:35am

ok thanks man!
I have to ask you for more infos though... especially the main differences
between the vanilla config and the config that is started via puppet, can
you share it?

simon

On Friday, May 24, 2013 11:51:33 PM UTC+2, Ben Siemon wrote:

thread dump attached. It is a little long from all the active threads. I
figured I would do the dump after it had been running for a while to get as
much detail as possible.

On Fri, May 24, 2013 at 5:03 PM, simonw <simon.w...@elasticsearch.com<javascript:>

wrote:
this looks pretty much ok to me. Yet, what would be interesting to me is
to see a thread dump of the node in question. Given that a vanilla started
0.90 node doesn't have the problem is a good thing IMO. Lets track this
down further. Can you produce a thread dump using jstack <pid> >> threaddumps.log?
while this is a blind shot, I guess that something blocks and creates new
threads all the time so I am wondering what they wait on and the thread
dump should tell us.

thanks,

simon

On Friday, May 24, 2013 9:44:02 PM UTC+2, Ben Siemon wrote:
2.0% (10ms out of 500ms) cpu usage by thread 'elasticsearch[apdv001.va.opower.it http://apdv001.va.opower.it][http_server_worker][T#9]{New I/O worker #107}'
10/10 snapshots sharing following 15 elements
sun.nio.ch.EPollArrayWrapper.**epollWait(Native Method)
sun.nio.ch.EPollArrayWrapper.**poll(EPollArrayWrapper.java:**210)
sun.nio.ch.EPollSelectorImpl.**doSelect(EPollSelectorImpl.**java:65)
sun.nio.ch.SelectorImpl.**lockAndDoSelect(SelectorImpl.**java:69)
sun.nio.ch.SelectorImpl.**select(SelectorImpl.java:80)
org.elasticsearch.common.**netty.channel.socket.nio.**SelectorUtil.select(**SelectorUtil.java:64)
org.elasticsearch.common.**netty.channel.socket.nio.**AbstractNioSelector.select(**AbstractNioSelector.java:409)
org.elasticsearch.common.**netty.channel.socket.nio.**AbstractNioSelector.run(**AbstractNioSelector.java:206)
org.elasticsearch.common.**netty.channel.socket.nio.**AbstractNioWorker.run(**AbstractNioWorker.java:88)
org.elasticsearch.common.**netty.channel.socket.nio.**NioWorker.run(NioWorker.java:**178)
org.elasticsearch.common.**netty.util.**ThreadRenamingRunnable.run(**ThreadRenamingRunnable.java:**108)
org.elasticsearch.common.**netty.util.internal.**DeadLockProofWorker$1.run(**DeadLockProofWorker.java:42)
java.util.concurrent.**ThreadPoolExecutor$Worker.**runTask(ThreadPoolExecutor.**java:886)
java.util.concurrent.**ThreadPoolExecutor$Worker.run(**ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.**java:619)
0.0% (0s out of 500ms) cpu usage by thread 'Reference Handler'
 10/10 snapshots sharing following 3 elements
   java.lang.Object.wait(Native Method)
   java.lang.Object.wait(Object.**java:485)
   java.lang.ref.Reference$**ReferenceHandler.run(**Reference.java:116)

0.0% (0s out of 500ms) cpu usage by thread 'Finalizer'
 10/10 snapshots sharing following 4 elements
   java.lang.Object.wait(Native Method)
   java.lang.ref.ReferenceQueue.**remove(ReferenceQueue.java:**118)
   java.lang.ref.ReferenceQueue.**remove(ReferenceQueue.java:**134)
   java.lang.ref.Finalizer$**FinalizerThread.run(Finalizer.**java:159)
This is ~7 minutes after a restart. The thread count has already
increased from 162 to 226.

I have bigdesk and head installed for plugins.

As part of my own investigation I downloaded and installed 0.90.0 to my
home directory on this dev tier app server. Then I ran elasticsearch -f to
see if it would exhibit the same behavior. Strangely it did not. This seems
to rule out any java version problems. We installed the instance with the
thread growth problem via puppet and the provided rpm. I am going to
continue to look at what the differences are.

What is used by the generic thread pool within the ES daemon? This
'generic' pool is the one that is growing and the only one that is even
getting work since I have no index or query operations running presently.

Thank you very much for your help!

On Fri, May 24, 2013 at 3:30 PM, simonw <simon.w...@**elasticsearch.com>wrote:

Ben, can you please get me the output of hot threads? I really want to
track this down. Can you tell what kind of plugins you have installed?

there must be a way to figure out what is going on on your side, please
provide us some insight.

simon

On Friday, May 24, 2013 7:43:55 PM UTC+2, Ben Siemon wrote:

Unfortunately I am unable to change the java version we use on any
tier without a great deal of hassle. We dont see this behavior on 0.20 on
the same machine configurations.

Are there changes in 0.90 w/r/t to the generic thread pool that would
be sensitive to VM changes?

It seems like downgrading to 0.20 is the only option for me.

On Fri, May 24, 2013 at 4:26 AM, simonw <simon.w...@**
elasticsearch.com> wrote:

you are running on a very old Java version Java version: 1.6.0_20
that is subject to break a lot of things. can you please update to a new VM
and try to reproduce the error.The outputs Roy mentioned might still be
very helpful.

thanks

On Friday, May 24, 2013 3:18:06 AM UTC+2, Roy Russo wrote:

Forgot to mention... you may want to also look at active threads:
http://www.**elasticsea****rch.org/guide/**reference/api/ad
min-cluster-**nodes-hot-**threads/http://www.elasticsearch.org/guide/reference/api/admin-cluster-nodes-hot-threads/

On Thursday, May 23, 2013 4:52:59 PM UTC-4, Ben Siemon wrote:

I have a single node setup on a development server.

VM name: Java HotSpot(TM) 64-Bit Server VM
VM vendor: Sun Microsystems Inc.
VM version: 16.3-b01
Java version: 1.6.0_20

ES Version 0.90.0

When I start the node the thread count begins to rise until the
node is unable to create any new threads due to OOM errors. I am not doing
any queries/indexing while this happens.

I am viewing the total thread count through the bigdesk plugin.

Most of my active threads are in the generic group which is set to
cached.

Is there a way for me to see where all these concurrent requests
are coming from and why they are being served from the generic group?
What is the generic thread pool for?
Could these management plugins be causing this overload?

Json output of the in use threads

thread_pool: {
generic: {
threads: 268
queue: 0
active: 266
rejected: 0
largest: 268
completed: 2896
}
index: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
get: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
snapshot: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}
merge: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}
bulk: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
warmer: {
threads: 1
queue: 0
active: 0
largest: 1
completed: 22
}
flush: {
threads: 1
queue: 0
active: 0
largest: 1
completed: 1
}
search: {
threads: 5
queue: 0
active: 0
rejected: 0
largest: 5
completed: 5
}
percolate: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
management: {
threads: 3
queue: 0
active: 1
largest: 3
completed: 2102
}
refresh: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}

--
You received this message because you are subscribed to a topic in
the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
to**pic/elasticsearch/**Wqr7Cb5ZEhU/**unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.**com.

For more options, visit https://groups.google.com/**grou**ps/opt_out https://groups.google.com/groups/opt_out
.

--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**Wqr7Cb5ZEhU/unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.
--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.
--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

simonw_2 · May 25, 2013, 12:18pm

I took a closer look at the threaddump and it seems that this node is
elected as the master and other nodes connect to it but once the master
wants to connect back it can't and blocks on connect(node) call on the
transport layer. Can you provide some more info regarding your cluster, do
you have firewall issues somehow or do you start and connect to a cluster
from a different version... Somehow we need to make sure that this doens't
take down a node but I'd wanna know what causes this.

simon

On Saturday, May 25, 2013 8:35:20 AM UTC+2, simonw wrote:

ok thanks man!
I have to ask you for more infos though... especially the main differences
between the vanilla config and the config that is started via puppet, can
you share it?

simon

On Friday, May 24, 2013 11:51:33 PM UTC+2, Ben Siemon wrote:
thread dump attached. It is a little long from all the active threads. I
figured I would do the dump after it had been running for a while to get as
much detail as possible.

On Fri, May 24, 2013 at 5:03 PM, simonw simon.w...@elasticsearch.comwrote:
this looks pretty much ok to me. Yet, what would be interesting to me is
to see a thread dump of the node in question. Given that a vanilla started
0.90 node doesn't have the problem is a good thing IMO. Lets track this
down further. Can you produce a thread dump using jstack <pid> >> threaddumps.log?
while this is a blind shot, I guess that something blocks and creates
new threads all the time so I am wondering what they wait on and the thread
dump should tell us.

thanks,

simon

On Friday, May 24, 2013 9:44:02 PM UTC+2, Ben Siemon wrote:
2.0% (10ms out of 500ms) cpu usage by thread 'elasticsearch[apdv001.va.opower.it http://apdv001.va.opower.it][http_server_worker][T#9]{New I/O worker #107}'
10/10 snapshots sharing following 15 elements
sun.nio.ch.EPollArrayWrapper.**epollWait(Native Method)
sun.nio.ch.EPollArrayWrapper.**poll(EPollArrayWrapper.java:**210)
sun.nio.ch.EPollSelectorImpl.**doSelect(EPollSelectorImpl.**java:65)
sun.nio.ch.SelectorImpl.**lockAndDoSelect(SelectorImpl.**java:69)
sun.nio.ch.SelectorImpl.**select(SelectorImpl.java:80)
org.elasticsearch.common.**netty.channel.socket.nio.**SelectorUtil.select(**SelectorUtil.java:64)
org.elasticsearch.common.**netty.channel.socket.nio.**AbstractNioSelector.select(**AbstractNioSelector.java:409)
org.elasticsearch.common.**netty.channel.socket.nio.**AbstractNioSelector.run(**AbstractNioSelector.java:206)
org.elasticsearch.common.**netty.channel.socket.nio.**AbstractNioWorker.run(**AbstractNioWorker.java:88)
org.elasticsearch.common.**netty.channel.socket.nio.**NioWorker.run(NioWorker.java:**178)
org.elasticsearch.common.**netty.util.**ThreadRenamingRunnable.run(**ThreadRenamingRunnable.java:**108)
org.elasticsearch.common.**netty.util.internal.**DeadLockProofWorker$1.run(**DeadLockProofWorker.java:42)
java.util.concurrent.**ThreadPoolExecutor$Worker.**runTask(ThreadPoolExecutor.**java:886)
java.util.concurrent.**ThreadPoolExecutor$Worker.run(**ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.**java:619)
0.0% (0s out of 500ms) cpu usage by thread 'Reference Handler'
 10/10 snapshots sharing following 3 elements
   java.lang.Object.wait(Native Method)
   java.lang.Object.wait(Object.**java:485)
   java.lang.ref.Reference$**ReferenceHandler.run(**Reference.java:116)

0.0% (0s out of 500ms) cpu usage by thread 'Finalizer'
 10/10 snapshots sharing following 4 elements
   java.lang.Object.wait(Native Method)
   java.lang.ref.ReferenceQueue.**remove(ReferenceQueue.java:**118)
   java.lang.ref.ReferenceQueue.**remove(ReferenceQueue.java:**134)
   java.lang.ref.Finalizer$**FinalizerThread.run(Finalizer.**java:159)
This is ~7 minutes after a restart. The thread count has already
increased from 162 to 226.

I have bigdesk and head installed for plugins.

As part of my own investigation I downloaded and installed 0.90.0 to my
home directory on this dev tier app server. Then I ran elasticsearch -f to
see if it would exhibit the same behavior. Strangely it did not. This seems
to rule out any java version problems. We installed the instance with the
thread growth problem via puppet and the provided rpm. I am going to
continue to look at what the differences are.

What is used by the generic thread pool within the ES daemon? This
'generic' pool is the one that is growing and the only one that is even
getting work since I have no index or query operations running presently.

Thank you very much for your help!

On Fri, May 24, 2013 at 3:30 PM, simonw <simon.w...@**elasticsearch.com

wrote:

Ben, can you please get me the output of hot threads? I really want to
track this down. Can you tell what kind of plugins you have installed?

there must be a way to figure out what is going on on your side,
please provide us some insight.

simon

On Friday, May 24, 2013 7:43:55 PM UTC+2, Ben Siemon wrote:

Unfortunately I am unable to change the java version we use on any
tier without a great deal of hassle. We dont see this behavior on 0.20 on
the same machine configurations.

Are there changes in 0.90 w/r/t to the generic thread pool that would
be sensitive to VM changes?

It seems like downgrading to 0.20 is the only option for me.

On Fri, May 24, 2013 at 4:26 AM, simonw <simon.w...@**
elasticsearch.com> wrote:

you are running on a very old Java version Java version: 1.6.0_20
that is subject to break a lot of things. can you please update to a new VM
and try to reproduce the error.The outputs Roy mentioned might still be
very helpful.

thanks

On Friday, May 24, 2013 3:18:06 AM UTC+2, Roy Russo wrote:

Forgot to mention... you may want to also look at active threads:
http://www.**elasticsea****rch.org/guide/**reference/api/ad
min-cluster-**nodes-hot-**threads/http://www.elasticsearch.org/guide/reference/api/admin-cluster-nodes-hot-threads/

On Thursday, May 23, 2013 4:52:59 PM UTC-4, Ben Siemon wrote:

I have a single node setup on a development server.

VM name: Java HotSpot(TM) 64-Bit Server VM
VM vendor: Sun Microsystems Inc.
VM version: 16.3-b01
Java version: 1.6.0_20

ES Version 0.90.0

When I start the node the thread count begins to rise until the
node is unable to create any new threads due to OOM errors. I am not doing
any queries/indexing while this happens.

I am viewing the total thread count through the bigdesk plugin.

Most of my active threads are in the generic group which is set to
cached.

Is there a way for me to see where all these concurrent requests
are coming from and why they are being served from the generic group?
What is the generic thread pool for?
Could these management plugins be causing this overload?

Json output of the in use threads

thread_pool: {
generic: {
threads: 268
queue: 0
active: 266
rejected: 0
largest: 268
completed: 2896
}
index: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
get: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
snapshot: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}
merge: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}
bulk: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
warmer: {
threads: 1
queue: 0
active: 0
largest: 1
completed: 22
}
flush: {
threads: 1
queue: 0
active: 0
largest: 1
completed: 1
}
search: {
threads: 5
queue: 0
active: 0
rejected: 0
largest: 5
completed: 5
}
percolate: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
management: {
threads: 3
queue: 0
active: 1
largest: 3
completed: 2102
}
refresh: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}

--
You received this message because you are subscribed to a topic in
the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
to**pic/elasticsearch/**Wqr7Cb5ZEhU/**unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.**com.

For more options, visit https://groups.google.com/**grou**ps/opt_out https://groups.google.com/groups/opt_out
.

--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**Wqr7Cb5ZEhU/unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.
--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ben_Siemon · May 25, 2013, 4:03pm

I made the configurations the same minus the cluster and node name. The
puppet managed one is started by init.d while the vanilla one is started by
me in the shell via bin/elasticsearch -f. Those are the only differences.

It is very strange that other nodes are trying to connect to it. This node
is the only one we have setup in the dev environment.

It might be that nodes we have running for production clusters on 0.20 are
somehow sending traffic to this node. If the prod 0.20 cluster is still
named 'elasticsearch' then they might try to bring this node into the
cluster.

I have attached the configuration running on the faulty node. I dont have
access to any of the prod machines so if that data is need it will have to
wait till after the short holiday.

Do you have any suggestions for which logs to turn up to debug/trace so I
can see the incoming connections being logged? I imagine the answer is in
there.

On Sat, May 25, 2013 at 8:18 AM, simonw
simon.willnauer@elasticsearch.comwrote:

I took a closer look at the threaddump and it seems that this node is
elected as the master and other nodes connect to it but once the master
wants to connect back it can't and blocks on connect(node) call on the
transport layer. Can you provide some more info regarding your cluster, do
you have firewall issues somehow or do you start and connect to a cluster
from a different version... Somehow we need to make sure that this doens't
take down a node but I'd wanna know what causes this.

simon

On Saturday, May 25, 2013 8:35:20 AM UTC+2, simonw wrote:
ok thanks man!
I have to ask you for more infos though... especially the main
differences between the vanilla config and the config that is started via
puppet, can you share it?

simon

On Friday, May 24, 2013 11:51:33 PM UTC+2, Ben Siemon wrote:
thread dump attached. It is a little long from all the active threads. I
figured I would do the dump after it had been running for a while to get as
much detail as possible.

On Fri, May 24, 2013 at 5:03 PM, simonw simon.w...@elasticsearch.comwrote:
this looks pretty much ok to me. Yet, what would be interesting to me
is to see a thread dump of the node in question. Given that a vanilla
started 0.90 node doesn't have the problem is a good thing IMO. Lets track
this down further. Can you produce a thread dump using `jstack

threaddumps.log`?
while this is a blind shot, I guess that something blocks and creates
new threads all the time so I am wondering what they wait on and the thread
dump should tell us.

thanks,

simon

On Friday, May 24, 2013 9:44:02 PM UTC+2, Ben Siemon wrote:
2.0% (10ms out of 500ms) cpu usage by thread 'elasticsearch[apdv001.va.opower.it http://apdv001.va.opower.it][http_server_worker]**[T#**9]{New I/O worker #107}'
10/10 snapshots sharing following 15 elements
sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.**java:**69)
sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
org.elasticsearch.common.netty.channel.socket.nio.SelectorUtil.select(**SelectorUtil.java:**64)
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.select(AbstractNioSelector.java:409)
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:206)
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(**AbstractNioWorker.**java:88)
org.elasticsearch.common.netty.channel.socket.nio.**NioWorker.**run(NioWorker.java:**178)
org.elasticsearch.common.netty.util.**ThreadRenamingRunnable.**run(**ThreadRenamingRunnable.**java:**108)
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
java.util.concurrent.ThreadPoolExecutor$Worker.**runTask(**ThreadPoolExecutor.**java:886)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:619)
0.0% (0s out of 500ms) cpu usage by thread 'Reference Handler'
 10/10 snapshots sharing following 3 elements
   java.lang.Object.wait(Native Method)
   java.lang.Object.wait(Object.**j**ava:485)
   java.lang.ref.Reference$**Refere**nceHandler.run(**Reference.java:**116)

0.0% (0s out of 500ms) cpu usage by thread 'Finalizer'
 10/10 snapshots sharing following 4 elements
   java.lang.Object.wait(Native Method)
   java.lang.ref.ReferenceQueue.**r**emove(ReferenceQueue.java:**118)
   java.lang.ref.ReferenceQueue.**r**emove(ReferenceQueue.java:**134)
   java.lang.ref.Finalizer$**Finali**zerThread.run(Finalizer.**java:**159)
This is ~7 minutes after a restart. The thread count has already
increased from 162 to 226.

I have bigdesk and head installed for plugins.

As part of my own investigation I downloaded and installed 0.90.0 to
my home directory on this dev tier app server. Then I ran elasticsearch -f
to see if it would exhibit the same behavior. Strangely it did not. This
seems to rule out any java version problems. We installed the instance with
the thread growth problem via puppet and the provided rpm. I am going to
continue to look at what the differences are.

What is used by the generic thread pool within the ES daemon? This
'generic' pool is the one that is growing and the only one that is even
getting work since I have no index or query operations running presently.

Thank you very much for your help!

On Fri, May 24, 2013 at 3:30 PM, simonw <simon.w...@**
elasticsearch.com> wrote:

Ben, can you please get me the output of hot threads? I really want
to track this down. Can you tell what kind of plugins you have installed?

there must be a way to figure out what is going on on your side,
please provide us some insight.

simon

On Friday, May 24, 2013 7:43:55 PM UTC+2, Ben Siemon wrote:

Unfortunately I am unable to change the java version we use on any
tier without a great deal of hassle. We dont see this behavior on 0.20 on
the same machine configurations.

Are there changes in 0.90 w/r/t to the generic thread pool that
would be sensitive to VM changes?

It seems like downgrading to 0.20 is the only option for me.

On Fri, May 24, 2013 at 4:26 AM, simonw <simon.w...@**
elasticsearch.com> wrote:

you are running on a very old Java version Java version: 1.6.0_20
that is subject to break a lot of things. can you please update to a new VM
and try to reproduce the error.The outputs Roy mentioned might still be
very helpful.

thanks

On Friday, May 24, 2013 3:18:06 AM UTC+2, Roy Russo wrote:

Forgot to mention... you may want to also look at active threads:
http://www.elasticsea****rch.org/guide/*reference/api/ad
*min-cluster-**nodes-hot-**threads/http://www.elasticsearch.org/guide/reference/api/admin-cluster-nodes-hot-threads/

On Thursday, May 23, 2013 4:52:59 PM UTC-4, Ben Siemon wrote:

I have a single node setup on a development server.

VM name: Java HotSpot(TM) 64-Bit Server VM
VM vendor: Sun Microsystems Inc.
VM version: 16.3-b01
Java version: 1.6.0_20

ES Version 0.90.0

When I start the node the thread count begins to rise until the
node is unable to create any new threads due to OOM errors. I am not doing
any queries/indexing while this happens.

I am viewing the total thread count through the bigdesk plugin.

Most of my active threads are in the generic group which is set
to cached.

Is there a way for me to see where all these concurrent requests
are coming from and why they are being served from the generic group?
What is the generic thread pool for?
Could these management plugins be causing this overload?

Json output of the in use threads

thread_pool: {
generic: {
threads: 268
queue: 0
active: 266
rejected: 0
largest: 268
completed: 2896
}
index: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
get: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
snapshot: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}
merge: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}
bulk: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
warmer: {
threads: 1
queue: 0
active: 0
largest: 1
completed: 22
}
flush: {
threads: 1
queue: 0
active: 0
largest: 1
completed: 1
}
search: {
threads: 5
queue: 0
active: 0
rejected: 0
largest: 5
completed: 5
}
percolate: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
management: {
threads: 3
queue: 0
active: 1
largest: 3
completed: 2102
}
refresh: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}

--
You received this message because you are subscribed to a topic in
the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/*
to***pic/elasticsearch/**Wqr7Cb5ZEhU/****unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email
to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**grou****
ps/opt_out https://groups.google.com/groups/opt_out.

--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.

--
You received this message because you are subscribed to a topic in
the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
to**pic/elasticsearch/**Wqr7Cb5ZEhU/**unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.**com.
For more options, visit https://groups.google.com/**grou**ps/opt_out https://groups.google.com/groups/opt_out
.

--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.
--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**Wqr7Cb5ZEhU/unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.**com.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.
--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.
--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

simonw_2 · May 25, 2013, 4:48pm

interesting, any chance you can change the cluster name and check if it
still happens?

simon

On Saturday, May 25, 2013 6:03:53 PM UTC+2, Ben Siemon wrote:

I made the configurations the same minus the cluster and node name. The
puppet managed one is started by init.d while the vanilla one is started by
me in the shell via bin/elasticsearch -f. Those are the only differences.

It is very strange that other nodes are trying to connect to it. This node
is the only one we have setup in the dev environment.

It might be that nodes we have running for production clusters on 0.20 are
somehow sending traffic to this node. If the prod 0.20 cluster is still
named 'elasticsearch' then they might try to bring this node into the
cluster.

I have attached the configuration running on the faulty node. I dont have
access to any of the prod machines so if that data is need it will have to
wait till after the short holiday.

Do you have any suggestions for which logs to turn up to debug/trace so I
can see the incoming connections being logged? I imagine the answer is in
there.

On Sat, May 25, 2013 at 8:18 AM, simonw <simon.w...@elasticsearch.com<javascript:>

wrote:
I took a closer look at the threaddump and it seems that this node is
elected as the master and other nodes connect to it but once the master
wants to connect back it can't and blocks on connect(node) call on the
transport layer. Can you provide some more info regarding your cluster, do
you have firewall issues somehow or do you start and connect to a cluster
from a different version... Somehow we need to make sure that this doens't
take down a node but I'd wanna know what causes this.

simon

On Saturday, May 25, 2013 8:35:20 AM UTC+2, simonw wrote:
ok thanks man!
I have to ask you for more infos though... especially the main
differences between the vanilla config and the config that is started via
puppet, can you share it?

simon

On Friday, May 24, 2013 11:51:33 PM UTC+2, Ben Siemon wrote:
thread dump attached. It is a little long from all the active threads.
I figured I would do the dump after it had been running for a while to get
as much detail as possible.

On Fri, May 24, 2013 at 5:03 PM, simonw simon.w...@elasticsearch.comwrote:
this looks pretty much ok to me. Yet, what would be interesting to me
is to see a thread dump of the node in question. Given that a vanilla
started 0.90 node doesn't have the problem is a good thing IMO. Lets track
this down further. Can you produce a thread dump using `jstack

threaddumps.log`?
while this is a blind shot, I guess that something blocks and creates
new threads all the time so I am wondering what they wait on and the thread
dump should tell us.

thanks,

simon

On Friday, May 24, 2013 9:44:02 PM UTC+2, Ben Siemon wrote:
2.0% (10ms out of 500ms) cpu usage by thread 'elasticsearch[apdv001.va.opower.it http://apdv001.va.opower.it][http_server_worker]**[T#**9]{New I/O worker #107}'
10/10 snapshots sharing following 15 elements
sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.**java:**69)
sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
org.elasticsearch.common.netty.channel.socket.nio.SelectorUtil.select(**SelectorUtil.java:**64)
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.select(AbstractNioSelector.java:409)
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:206)
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(**AbstractNioWorker.**java:88)
org.elasticsearch.common.netty.channel.socket.nio.**NioWorker.**run(NioWorker.java:**178)
org.elasticsearch.common.netty.util.**ThreadRenamingRunnable.**run(**ThreadRenamingRunnable.**java:**108)
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
java.util.concurrent.ThreadPoolExecutor$Worker.**runTask(**ThreadPoolExecutor.**java:886)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:619)
0.0% (0s out of 500ms) cpu usage by thread 'Reference Handler'
 10/10 snapshots sharing following 3 elements
   java.lang.Object.wait(Native Method)
   java.lang.Object.wait(Object.**j**ava:485)
   java.lang.ref.Reference$**Refere**nceHandler.run(**Reference.java:**116)

0.0% (0s out of 500ms) cpu usage by thread 'Finalizer'
 10/10 snapshots sharing following 4 elements
   java.lang.Object.wait(Native Method)
   java.lang.ref.ReferenceQueue.**r**emove(ReferenceQueue.java:**118)
   java.lang.ref.ReferenceQueue.**r**emove(ReferenceQueue.java:**134)
   java.lang.ref.Finalizer$**Finali**zerThread.run(Finalizer.**java:**159)
This is ~7 minutes after a restart. The thread count has already
increased from 162 to 226.

I have bigdesk and head installed for plugins.

As part of my own investigation I downloaded and installed 0.90.0 to
my home directory on this dev tier app server. Then I ran elasticsearch -f
to see if it would exhibit the same behavior. Strangely it did not. This
seems to rule out any java version problems. We installed the instance with
the thread growth problem via puppet and the provided rpm. I am going to
continue to look at what the differences are.

What is used by the generic thread pool within the ES daemon? This
'generic' pool is the one that is growing and the only one that is even
getting work since I have no index or query operations running presently.

Thank you very much for your help!

On Fri, May 24, 2013 at 3:30 PM, simonw <simon.w...@**
elasticsearch.com> wrote:

Ben, can you please get me the output of hot threads? I really want
to track this down. Can you tell what kind of plugins you have installed?

there must be a way to figure out what is going on on your side,
please provide us some insight.

simon

On Friday, May 24, 2013 7:43:55 PM UTC+2, Ben Siemon wrote:

Unfortunately I am unable to change the java version we use on any
tier without a great deal of hassle. We dont see this behavior on 0.20 on
the same machine configurations.

Are there changes in 0.90 w/r/t to the generic thread pool that
would be sensitive to VM changes?

It seems like downgrading to 0.20 is the only option for me.

On Fri, May 24, 2013 at 4:26 AM, simonw <simon.w...@**
elasticsearch.com> wrote:

you are running on a very old Java version Java version: 1.6.0_20
that is subject to break a lot of things. can you please update to a new VM
and try to reproduce the error.The outputs Roy mentioned might still be
very helpful.

thanks

On Friday, May 24, 2013 3:18:06 AM UTC+2, Roy Russo wrote:

Forgot to mention... you may want to also look at active threads:
http://www.elasticsea****rch.org/guide/**reference/api/ad
**min-cluster-**nodes-hot-**threads/http://www.elasticsearch.org/guide/reference/api/admin-cluster-nodes-hot-threads/

On Thursday, May 23, 2013 4:52:59 PM UTC-4, Ben Siemon wrote:

I have a single node setup on a development server.

VM name: Java HotSpot(TM) 64-Bit Server VM
VM vendor: Sun Microsystems Inc.
VM version: 16.3-b01
Java version: 1.6.0_20

ES Version 0.90.0

When I start the node the thread count begins to rise until the
node is unable to create any new threads due to OOM errors. I am not doing
any queries/indexing while this happens.

I am viewing the total thread count through the bigdesk plugin.

Most of my active threads are in the generic group which is set
to cached.

Is there a way for me to see where all these concurrent
requests are coming from and why they are being served from the generic
group?
What is the generic thread pool for?
Could these management plugins be causing this overload?

Json output of the in use threads

thread_pool: {
generic: {
threads: 268
queue: 0
active: 266
rejected: 0
largest: 268
completed: 2896
}
index: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
get: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
snapshot: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}
merge: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}
bulk: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
warmer: {
threads: 1
queue: 0
active: 0
largest: 1
completed: 22
}
flush: {
threads: 1
queue: 0
active: 0
largest: 1
completed: 1
}
search: {
threads: 5
queue: 0
active: 0
rejected: 0
largest: 5
completed: 5
}
percolate: {
threads: 0
queue: 0
active: 0
rejected: 0
largest: 0
completed: 0
}
management: {
threads: 3
queue: 0
active: 1
largest: 3
completed: 2102
}
refresh: {
threads: 0
queue: 0
active: 0
largest: 0
completed: 0
}

--
You received this message because you are subscribed to a topic in
the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
**to****pic/elasticsearch/**Wqr7Cb5ZEhU/****unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email
to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**grou****
ps/opt_out https://groups.google.com/groups/opt_out.

--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.

--
You received this message because you are subscribed to a topic in
the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
to**pic/elasticsearch/**Wqr7Cb5ZEhU/**unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.**com.
For more options, visit https://groups.google.com/**grou**ps/opt_out https://groups.google.com/groups/opt_out
.

--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.
--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**Wqr7Cb5ZEhU/unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.**com.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.
--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.
--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.
--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · May 28, 2013, 2:17pm

So there is another elasticsearch cluster on the same network? If you are
using multicast discovery, try using unicast discovery to reduce chatter
between nodes that should not be forming a cluster.

--
Ivan

On Sat, May 25, 2013 at 9:03 AM, Ben Siemon ben.siemon@opower.com wrote:

It might be that nodes we have running for production clusters on 0.20 are
somehow sending traffic to this node. If the prod 0.20 cluster is still
named 'elasticsearch' then they might try to bring this node into the
cluster.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ben_Siemon · May 28, 2013, 2:37pm

I have investigated this a little with our sys ops team. There are two
clusters with 'elasticsearch' as the name but they both have multicast
off and are on separate/disjoint network segments in the datacenter. I am
going to do some investigation with tcpdump to see where the traffic is
coming from.

Thanks for the help everyone! I will update this thread with the root cause
when I find it.

On Tue, May 28, 2013 at 10:17 AM, Ivan Brusic ivan@brusic.com wrote:

So there is another elasticsearch cluster on the same network? If you are
using multicast discovery, try using unicast discovery to reduce chatter
between nodes that should not be forming a cluster.

--
Ivan

On Sat, May 25, 2013 at 9:03 AM, Ben Siemon ben.siemon@opower.com wrote:

It might be that nodes we have running for production clusters on 0.20
are somehow sending traffic to this node. If the prod 0.20 cluster is still
named 'elasticsearch' then they might try to bring this node into the
cluster.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ben_Siemon · May 28, 2013, 6:40pm

Root cause summary:

In a misconfigured network where a client can connect to a node on 9300
(from client: telnet node-ip 9300 works)
but a node can not make the reverse connection. (from node telnet client-ip
9300 does not work). This results in the following exception on the node. I
surmise that the thread which throws this exception is not properly
reclaimed.

2013-05-28 14:07:47,624][TRACE][transport.netty ]
[ben.siemon.home.dir] connect exception caught on transport layer [[id:
0x50d87fef]]
org.elasticsearch.common.netty.channel.ConnectTimeoutException: connection
timed out: /10.20.64.133:9300
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.processConnectTimeout(NioClientBoss.java:137)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:83)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

10.20.64.133 is the client in this example. We see the timeout occur as the
node attempts to connect back to the client.

approximate timeline;

Client connects to Node and attempts to join cluster. (success)
Node attempts to create a new tcp connection to client (timeout).
thread used to connect to client in step 2 is not reclaimed.

The client is using the NodeClient.

On Tue, May 28, 2013 at 10:37 AM, Ben Siemon ben.siemon@opower.com wrote:

I have investigated this a little with our sys ops team. There are two
clusters with 'elasticsearch' as the name but they both have multicast
off and are on separate/disjoint network segments in the datacenter. I am
going to do some investigation with tcpdump to see where the traffic is
coming from.

Thanks for the help everyone! I will update this thread with the root
cause when I find it.

On Tue, May 28, 2013 at 10:17 AM, Ivan Brusic ivan@brusic.com wrote:

So there is another elasticsearch cluster on the same network? If you are
using multicast discovery, try using unicast discovery to reduce chatter
between nodes that should not be forming a cluster.

--
Ivan

On Sat, May 25, 2013 at 9:03 AM, Ben Siemon ben.siemon@opower.comwrote:

It might be that nodes we have running for production clusters on 0.20
are somehow sending traffic to this node. If the prod 0.20 cluster is still
named 'elasticsearch' then they might try to bring this node into the
cluster.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.

--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

simonw_2 · May 29, 2013, 11:36am

I think by default the timeout is very high like 15 min or so, are you sure
it's not reclaimed or does it just take forever?

and thanks for clarifying the issue!

simon

On Tuesday, May 28, 2013 8:40:38 PM UTC+2, Ben Siemon wrote:

Root cause summary:

In a misconfigured network where a client can connect to a node on 9300
(from client: telnet node-ip 9300 works)
but a node can not make the reverse connection. (from node telnet
client-ip 9300 does not work). This results in the following exception on
the node. I surmise that the thread which throws this exception is not
properly reclaimed.

2013-05-28 14:07:47,624][TRACE][transport.netty ]
[ben.siemon.home.dir] connect exception caught on transport layer [[id:
0x50d87fef]]
org.elasticsearch.common.netty.channel.ConnectTimeoutException: connection
timed out: /10.20.64.133:9300
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.processConnectTimeout(NioClientBoss.java:137)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:83)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

10.20.64.133 is the client in this example. We see the timeout occur as
the node attempts to connect back to the client.

approximate timeline;

Client connects to Node and attempts to join cluster. (success)

Node attempts to create a new tcp connection to client (timeout).

thread used to connect to client in step 2 is not reclaimed.

The client is using the NodeClient.

On Tue, May 28, 2013 at 10:37 AM, Ben Siemon <ben.s...@opower.com<javascript:>

wrote:

I have investigated this a little with our sys ops team. There are two
clusters with 'elasticsearch' as the name but they both have multicast
off and are on separate/disjoint network segments in the datacenter. I am
going to do some investigation with tcpdump to see where the traffic is
coming from.

Thanks for the help everyone! I will update this thread with the root
cause when I find it.

On Tue, May 28, 2013 at 10:17 AM, Ivan Brusic <iv...@brusic.com<javascript:>

wrote:

So there is another elasticsearch cluster on the same network? If you
are using multicast discovery, try using unicast discovery to reduce
chatter between nodes that should not be forming a cluster.

--
Ivan

On Sat, May 25, 2013 at 9:03 AM, Ben Siemon <ben.s...@opower.com<javascript:>

wrote:

It might be that nodes we have running for production clusters on 0.20
are somehow sending traffic to this node. If the prod 0.20 cluster is still
named 'elasticsearch' then they might try to bring this node into the
cluster.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.

--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ben_Siemon · May 29, 2013, 4:03pm

The exception happens very quickly after the node starts up and the client
begins to connect. Seen in the logs below the clusters starts up around
13:50:07 then the first timeout happens at 13:50:45.

Even after I fix the bad network configuration the thread count never goes
back down on the node experiencing timeouts.

Even in the case where it takes forever (15 mins+) to reclaim these threads
this is still a denial of service vulnerability and a a potential headache
in the event of some weird network partition.

[2013-05-28 13:50:07,674][INFO ][node ]
[ben.siemon.home.dir] {0.90.0}[24486]: initializing ...
[2013-05-28 13:50:07,674][DEBUG][node ]
[ben.siemon.home.dir] using home [/home/ben.siemon/elasticsearch-0.90.0],
config [/home/ben.siemon/elasticsearch-0.90.0/config], data
[[/home/ben.siemon/elasticsearch-0.90.0/data]], logs [/home/ben.siemon/elast
icsearch-0.90.0/logs], work [/home/ben.siemon/elasticsearch-0.90.0/work],
plugins [/home/ben.siemon/elasticsearch-0.90.0/plugins]
[2013-05-28 13:50:07,684][TRACE][plugins ]
[ben.siemon.home.dir] --- adding plugin
[/home/ben.siemon/elasticsearch-0.90.0/plugins/bigdesk]
[2013-05-28 13:50:07,690][INFO ][plugins ]
[ben.siemon.home.dir] loaded , sites [bigdesk]
.
.
.

                             client                                node

[2013-05-28 13:50:24,916][TRACE][transport.netty ]
[ben.siemon.home.dir] channel opened: [id: 0x03584630, /10.20.64.133:59956=> /
10.20.64.135:9300]
[2013-05-28 13:50:27,917][TRACE][transport.netty ]
[ben.siemon.home.dir] channel closed: [id: 0x03584630, /10.20.64.133:59956=> /
10.20.64.135:9300]
[2013-05-28 13:50:30,919][TRACE][transport.netty ]
[ben.siemon.home.dir] channel opened: [id: 0x01ecfd4b, /10.20.64.133:59959=> /
10.20.64.135:9300]
[2013-05-28 13:50:33,920][TRACE][transport.netty ]
[ben.siemon.home.dir] channel closed: [id: 0x01ecfd4b, /10.20.64.133:59959=> /
10.20.64.135:9300]
[2013-05-28 13:50:36,924][TRACE][transport.netty ]
[ben.siemon.home.dir] channel opened: [id: 0x1cb22d16, /10.20.64.133:59977=> /
10.20.64.135:9300]
[2013-05-28 13:50:39,926][TRACE][transport.netty ]
[ben.siemon.home.dir] channel closed: [id: 0x1cb22d16, /10.20.64.133:59977=> /
10.20.64.135:9300]
[2013-05-28 13:50:42,790][TRACE][http.netty ]
[ben.siemon.home.dir] channel opened: [id: 0x8d06cb4e, /10.1.10.142:58530=> /
10.20.64.135:9200]
[2013-05-28 13:50:42,934][TRACE][transport.netty ]
[ben.siemon.home.dir] channel opened: [id: 0x8afb3fe6, /10.20.64.133:60000=> /
10.20.64.135:9300]
[2013-05-28 13:50:45,926][TRACE][transport.netty ]
[ben.siemon.home.dir] connect exception caught on transport layer [[id:
0x01490020]]
org.elasticsearch.common.netty.channel.ConnectTimeoutException: connection
timed out: /10.20.64.133:9300 (node unable to connect back to client)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.processConnectTimeout(NioClientBoss.java:137)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:83)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

On Wed, May 29, 2013 at 7:36 AM, simonw
simon.willnauer@elasticsearch.comwrote:

I think by default the timeout is very high like 15 min or so, are you
sure it's not reclaimed or does it just take forever?

and thanks for clarifying the issue!

simon

On Tuesday, May 28, 2013 8:40:38 PM UTC+2, Ben Siemon wrote:

Root cause summary:

In a misconfigured network where a client can connect to a node on 9300
(from client: telnet node-ip 9300 works)
but a node can not make the reverse connection. (from node telnet
client-ip 9300 does not work). This results in the following exception on
the node. I surmise that the thread which throws this exception is not
properly reclaimed.

2013-05-28 14:07:47,624][TRACE][**transport.netty ]
[ben.siemon.home.dir] connect exception caught on transport layer [[id:
0x50d87fef]]
org.elasticsearch.common.**netty.channel.**ConnectTimeoutException:
connection timed out: /10.20.64.133:9300
at org.elasticsearch.common.netty.channel.socket.nio.
NioClientBoss.**processConnectTimeout(**NioClientBoss.java:137)
at org.elasticsearch.common.netty.channel.socket.nio.
NioClientBoss.process(**NioClientBoss.java:83)
at org.elasticsearch.common.netty.channel.socket.nio.
AbstractNioSelector.run(**AbstractNioSelector.java:312)
at org.elasticsearch.common.netty.channel.socket.nio.
NioClientBoss.run(**NioClientBoss.java:42)
at org.elasticsearch.common.netty.util.
ThreadRenamingRunnable.run(**ThreadRenamingRunnable.java:**108)
at org.elasticsearch.common.netty.util.internal.
DeadLockProofWorker$1.run(**DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor$Worker.
runTask(ThreadPoolExecutor.**java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.**java:619)

10.20.64.133 is the client in this example. We see the timeout occur as
the node attempts to connect back to the client.

approximate timeline;

Client connects to Node and attempts to join cluster. (success)

Node attempts to create a new tcp connection to client (timeout).

thread used to connect to client in step 2 is not reclaimed.

The client is using the NodeClient.

On Tue, May 28, 2013 at 10:37 AM, Ben Siemon ben.s...@opower.com wrote:

I have investigated this a little with our sys ops team. There are two
clusters with 'elasticsearch' as the name but they both have multicast
off and are on separate/disjoint network segments in the datacenter. I am
going to do some investigation with tcpdump to see where the traffic is
coming from.

Thanks for the help everyone! I will update this thread with the root
cause when I find it.

On Tue, May 28, 2013 at 10:17 AM, Ivan Brusic iv...@brusic.com wrote:

So there is another elasticsearch cluster on the same network? If you
are using multicast discovery, try using unicast discovery to reduce
chatter between nodes that should not be forming a cluster.

--
Ivan

On Sat, May 25, 2013 at 9:03 AM, Ben Siemon ben.s...@opower.comwrote:

It might be that nodes we have running for production clusters on 0.20
are somehow sending traffic to this node. If the prod 0.20 cluster is still
named 'elasticsearch' then they might try to bring this node into the
cluster.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**Wqr7Cb5ZEhU/unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.

--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com

We’re hiring! See jobs here http://www.opower.com/careers.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Cluster crash, symptoms and possible explanation Elasticsearch	20	2141	July 6, 2017
Cluster locks up Elasticsearch	9	1677	July 6, 2017
Certain rest requests time out Elasticsearch	15	481	July 6, 2017
Newbie - memory issues Elasticsearch	3	442	July 6, 2017
Elasticsearch threads behaves different from each other Elasticsearch	7	588	July 6, 2017

Total threads in use increases without bound until node crashes

Related topics