The exception happens very quickly after the node starts up and the client
begins to connect. Seen in the logs below the clusters starts up around
13:50:07 then the first timeout happens at 13:50:45.
Even after I fix the bad network configuration the thread count never goes
back down on the node experiencing timeouts.
Even in the case where it takes forever (15 mins+) to reclaim these threads
this is still a denial of service vulnerability and a a potential headache
in the event of some weird network partition.
[2013-05-28 13:50:07,674][INFO ][node ]
[ben.siemon.home.dir] {0.90.0}[24486]: initializing ...
[2013-05-28 13:50:07,674][DEBUG][node ]
[ben.siemon.home.dir] using home [/home/ben.siemon/elasticsearch-0.90.0],
config [/home/ben.siemon/elasticsearch-0.90.0/config], data
[[/home/ben.siemon/elasticsearch-0.90.0/data]], logs [/home/ben.siemon/elast
icsearch-0.90.0/logs], work [/home/ben.siemon/elasticsearch-0.90.0/work],
plugins [/home/ben.siemon/elasticsearch-0.90.0/plugins]
[2013-05-28 13:50:07,684][TRACE][plugins ]
[ben.siemon.home.dir] --- adding plugin
[/home/ben.siemon/elasticsearch-0.90.0/plugins/bigdesk]
[2013-05-28 13:50:07,690][INFO ][plugins ]
[ben.siemon.home.dir] loaded , sites [bigdesk]
.
.
.
client node
[2013-05-28 13:50:24,916][TRACE][transport.netty ]
[ben.siemon.home.dir] channel opened: [id: 0x03584630, /10.20.64.133:59956=> /
10.20.64.135:9300]
[2013-05-28 13:50:27,917][TRACE][transport.netty ]
[ben.siemon.home.dir] channel closed: [id: 0x03584630, /10.20.64.133:59956=> /
10.20.64.135:9300]
[2013-05-28 13:50:30,919][TRACE][transport.netty ]
[ben.siemon.home.dir] channel opened: [id: 0x01ecfd4b, /10.20.64.133:59959=> /
10.20.64.135:9300]
[2013-05-28 13:50:33,920][TRACE][transport.netty ]
[ben.siemon.home.dir] channel closed: [id: 0x01ecfd4b, /10.20.64.133:59959=> /
10.20.64.135:9300]
[2013-05-28 13:50:36,924][TRACE][transport.netty ]
[ben.siemon.home.dir] channel opened: [id: 0x1cb22d16, /10.20.64.133:59977=> /
10.20.64.135:9300]
[2013-05-28 13:50:39,926][TRACE][transport.netty ]
[ben.siemon.home.dir] channel closed: [id: 0x1cb22d16, /10.20.64.133:59977=> /
10.20.64.135:9300]
[2013-05-28 13:50:42,790][TRACE][http.netty ]
[ben.siemon.home.dir] channel opened: [id: 0x8d06cb4e, /10.1.10.142:58530=> /
10.20.64.135:9200]
[2013-05-28 13:50:42,934][TRACE][transport.netty ]
[ben.siemon.home.dir] channel opened: [id: 0x8afb3fe6, /10.20.64.133:60000=> /
10.20.64.135:9300]
[2013-05-28 13:50:45,926][TRACE][transport.netty ]
[ben.siemon.home.dir] connect exception caught on transport layer [[id:
0x01490020]]
org.elasticsearch.common.netty.channel.ConnectTimeoutException: connection
timed out: /10.20.64.133:9300 (node unable to connect back to client)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.processConnectTimeout(NioClientBoss.java:137)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:83)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
On Wed, May 29, 2013 at 7:36 AM, simonw
simon.willnauer@elasticsearch.comwrote:
I think by default the timeout is very high like 15 min or so, are you
sure it's not reclaimed or does it just take forever?
and thanks for clarifying the issue!
simon
On Tuesday, May 28, 2013 8:40:38 PM UTC+2, Ben Siemon wrote:
Root cause summary:
In a misconfigured network where a client can connect to a node on 9300
(from client: telnet node-ip 9300 works)
but a node can not make the reverse connection. (from node telnet
client-ip 9300 does not work). This results in the following exception on
the node. I surmise that the thread which throws this exception is not
properly reclaimed.
2013-05-28 14:07:47,624][TRACE][**transport.netty ]
[ben.siemon.home.dir] connect exception caught on transport layer [[id:
0x50d87fef]]
org.elasticsearch.common.**netty.channel.**ConnectTimeoutException:
connection timed out: /10.20.64.133:9300
at org.elasticsearch.common.netty.channel.socket.nio.
NioClientBoss.**processConnectTimeout(**NioClientBoss.java:137)
at org.elasticsearch.common.netty.channel.socket.nio.
NioClientBoss.process(**NioClientBoss.java:83)
at org.elasticsearch.common.netty.channel.socket.nio.
AbstractNioSelector.run(**AbstractNioSelector.java:312)
at org.elasticsearch.common.netty.channel.socket.nio.
NioClientBoss.run(**NioClientBoss.java:42)
at org.elasticsearch.common.netty.util.
ThreadRenamingRunnable.run(**ThreadRenamingRunnable.java:**108)
at org.elasticsearch.common.netty.util.internal.
DeadLockProofWorker$1.run(**DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor$Worker.
runTask(ThreadPoolExecutor.**java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.**java:619)
10.20.64.133 is the client in this example. We see the timeout occur as
the node attempts to connect back to the client.
approximate timeline;
- Client connects to Node and attempts to join cluster. (success)
- Node attempts to create a new tcp connection to client (timeout).
- thread used to connect to client in step 2 is not reclaimed.
The client is using the NodeClient.
On Tue, May 28, 2013 at 10:37 AM, Ben Siemon ben.s...@opower.com wrote:
I have investigated this a little with our sys ops team. There are two
clusters with 'elasticsearch' as the name but they both have multicast
off and are on separate/disjoint network segments in the datacenter. I am
going to do some investigation with tcpdump to see where the traffic is
coming from.
Thanks for the help everyone! I will update this thread with the root
cause when I find it.
On Tue, May 28, 2013 at 10:17 AM, Ivan Brusic iv...@brusic.com wrote:
So there is another elasticsearch cluster on the same network? If you
are using multicast discovery, try using unicast discovery to reduce
chatter between nodes that should not be forming a cluster.
--
Ivan
On Sat, May 25, 2013 at 9:03 AM, Ben Siemon ben.s...@opower.comwrote:
It might be that nodes we have running for production clusters on 0.20
are somehow sending traffic to this node. If the prod 0.20 cluster is still
named 'elasticsearch' then they might try to bring this node into the
cluster.
--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**Wqr7Cb5ZEhU/unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.
--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com
We’re hiring! See jobs here http://www.opower.com/careers.
--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com
We’re hiring! See jobs here http://www.opower.com/careers.
--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Wqr7Cb5ZEhU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
Ben Siemon
Senior Software Engineer, Engineering
Opower http://www.opower.com
We’re hiring! See jobs here http://www.opower.com/careers.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.