Any clues about transport connection issues on AWS HVM instances?

Hi Elasticsearch list :slight_smile:

I'm having some trouble while running Elasticsearch on r3.large (HVM
virtualization) instances in AWS. The short story is that, as soon as I put
any significant load on them, some requests take a very long time (for
example, Indices Stats) and I see disconnected/timeout errors in the logs.
Did anyone else experience similar things or has any ideas of another
solution than avoiding HVM instances?

More detailed symptoms:

  • if there's very little load on them (say, 2GB of data on each node, few
    queries and indexing operations) all is well
  • by "significant load", I mean some 10GB of data, a few queries per
    minute, 100 docs indexed per second (4K per doc, <10 fields). By no means
    "overload", CPU rarely tops 20%, no significant GC, nothing suspicious in
    any of the metrics SPM http://sematext.com/spm/ collects. The only clue
    is that, for the time the problem appears, we get heartbeat alerts because
    requests to the stats APIs take too long
  • by "some requests take very long time", I mean that some queries take
    miliseconds (as I would expect them), and some take 10 minutes or so.
    Eventually succeeding (at least this was the case for the manual requests
    I've sent)
  • sometimes, nodes get temporarily dropped from the cluster, but then
    things quickly come back to green. However, sometimes shards were stuck
    while relocating

Things I've tried:

  • different ES versions and machine sizes: the same problem seems to appear
    on 0.90.7 with r3.xlarge instances, I'm on 1.1.1 with r3.large
  • teared down all machines and launched other ones and redeployed. Same
    thing
  • different JVM (1.7) versions: Oracle u25, u45, u55, u60, OpenJDK u51.
    Same thing everywhere
  • spawned the same number of machines with m3.large (same specs as
    r3.large, except for half of the RAM, paravirtual instead of HVM). The
    problem magically went away with the same data and load

Here are some Node Disconnected exceptions:
[2014-06-18 13:05:35,058][WARN ][search.action ] [es01] Failed
to send release search context
org.elasticsearch.transport.NodeDisconnectedException:
[es02][inet[/10.140.1.84:9300]][search/freeContext] disconnected
[2014-06-18 13:05:35,058][DEBUG][action.admin.indices.stats] [es01]
[83f0223f-4222-4a57-a918-ff424924f002_2014-05-20][1],
node[oOlO-iewR3qnAuQkT28vfw], [P], s[STARTED]: Failed to execute
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@3339f285]
org.elasticsearch.transport.NodeDisconnectedException:
[es02][inet[/10.140.1.84:9300]][indices/stats/s] disconnected

I've enabled TRACE logging on both transport and discovery and all I see is
connection timeouts and exceptions, like:

07:29:19,039][TRACE][transport.netty ] [es01] close connection exception
caught on transport layer [[id: 0x190d8444]], disconnecting from relevant
node

Or, more verbose:

[2014-06-16 07:29:19,060][TRACE][transport.netty ] [es01] connect
exception caught on transport layer [[id: 0x6816c0fe]]
org.elasticsearch.common.netty.channel.ConnectTimeoutException: connection
timed out: es03/10.171.39.244:9300
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.processConnectTimeout(NioClientBoss.java:137)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:83)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2014-06-16 07:29:19,060][TRACE][discovery.zen.ping.unicast] [es01] [1]
failed to connect to [#zen_unicast_7#][es01][inet[es04/10.79.155.249:9300]]
org.elasticsearch.transport.ConnectTransportException: [][inet[es04/
10.79.155.249:9300]] connect_timeout[30s]
at
org.elasticsearch.transport.netty.NettyTransport.connectToChannelsLight(NettyTransport.java:683)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:643)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNodeLight(NettyTransport.java:610)
at
org.elasticsearch.transport.TransportService.connectToNodeLight(TransportService.java:133)
at
org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:279)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.common.netty.channel.ConnectTimeoutException:
connection timed out: es03/10.171.39.244:9300
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.processConnectTimeout(NioClientBoss.java:137)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:83)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
... 3 more

I'll appreciate any information, pointers, intuition you may have!

Thanks and best regards,
Radu

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHXA0_1vD405d8LbuDUV-vJ1yminf23%2BDCbRecFFnHZ4ywfj0g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.