Any clues about transport connection issues on AWS HVM instances?

radu_gheorghe · June 19, 2014, 10:44am

Hi Elasticsearch list

I'm having some trouble while running Elasticsearch on r3.large (HVM
virtualization) instances in AWS. The short story is that, as soon as I put
any significant load on them, some requests take a very long time (for
example, Indices Stats) and I see disconnected/timeout errors in the logs.
Did anyone else experience similar things or has any ideas of another
solution than avoiding HVM instances?

More detailed symptoms:

if there's very little load on them (say, 2GB of data on each node, few
queries and indexing operations) all is well
by "significant load", I mean some 10GB of data, a few queries per
minute, 100 docs indexed per second (4K per doc, <10 fields). By no means
"overload", CPU rarely tops 20%, no significant GC, nothing suspicious in
any of the metrics SPM http://sematext.com/spm/ collects. The only clue
is that, for the time the problem appears, we get heartbeat alerts because
requests to the stats APIs take too long
by "some requests take very long time", I mean that some queries take
miliseconds (as I would expect them), and some take 10 minutes or so.
Eventually succeeding (at least this was the case for the manual requests
I've sent)
sometimes, nodes get temporarily dropped from the cluster, but then
things quickly come back to green. However, sometimes shards were stuck
while relocating

Things I've tried:

different ES versions and machine sizes: the same problem seems to appear
on 0.90.7 with r3.xlarge instances, I'm on 1.1.1 with r3.large
teared down all machines and launched other ones and redeployed. Same
thing
different JVM (1.7) versions: Oracle u25, u45, u55, u60, OpenJDK u51.
Same thing everywhere
spawned the same number of machines with m3.large (same specs as
r3.large, except for half of the RAM, paravirtual instead of HVM). The
problem magically went away with the same data and load

Here are some Node Disconnected exceptions:
[2014-06-18 13:05:35,058][WARN ][search.action ] [es01] Failed
to send release search context
org.elasticsearch.transport.NodeDisconnectedException:
[es02][inet[/10.140.1.84:9300]][search/freeContext] disconnected
[2014-06-18 13:05:35,058][DEBUG][action.admin.indices.stats] [es01]
[83f0223f-4222-4a57-a918-ff424924f002_2014-05-20][1],
node[oOlO-iewR3qnAuQkT28vfw], [P], s[STARTED]: Failed to execute
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@3339f285]
org.elasticsearch.transport.NodeDisconnectedException:
[es02][inet[/10.140.1.84:9300]][indices/stats/s] disconnected

I've enabled TRACE logging on both transport and discovery and all I see is
connection timeouts and exceptions, like:

07:29:19,039][TRACE][transport.netty ] [es01] close connection exception
caught on transport layer [[id: 0x190d8444]], disconnecting from relevant
node

Or, more verbose:

[2014-06-16 07:29:19,060][TRACE][transport.netty ] [es01] connect
exception caught on transport layer [[id: 0x6816c0fe]]
org.elasticsearch.common.netty.channel.ConnectTimeoutException: connection
timed out: es03/10.171.39.244:9300
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.processConnectTimeout(NioClientBoss.java:137)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:83)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2014-06-16 07:29:19,060][TRACE][discovery.zen.ping.unicast] [es01] [1]
failed to connect to [#zen_unicast_7#][es01][inet[es04/10.79.155.249:9300]]
org.elasticsearch.transport.ConnectTransportException: [][inet[es04/
10.79.155.249:9300]] connect_timeout[30s]
at
org.elasticsearch.transport.netty.NettyTransport.connectToChannelsLight(NettyTransport.java:683)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:643)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNodeLight(NettyTransport.java:610)
at
org.elasticsearch.transport.TransportService.connectToNodeLight(TransportService.java:133)
at
org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:279)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.common.netty.channel.ConnectTimeoutException:
connection timed out: es03/10.171.39.244:9300
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.processConnectTimeout(NioClientBoss.java:137)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:83)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
... 3 more

I'll appreciate any information, pointers, intuition you may have!

Thanks and best regards,
Radu

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHXA0_1vD405d8LbuDUV-vJ1yminf23%2BDCbRecFFnHZ4ywfj0g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Any clues about transport connection issues on AWS HVM instances?

Thanks and best regards, Radu

Thanks and best regards,
Radu