Seeing Frequent NodeNotConnectedException errors


(Ranga-2) #1

Hi,

I have a cluster of 3-node cluster in EC2 - and am seeing frequent
NodeNotConnectedException related errors which cause intermittent failures
during indexing. I'm hoping some one knows what this is able and can help.

Thanks in advance for your help - Here are the details -

There are 3 nodes (es1, es2 and es3 - all are defined to be
node.master=true, node.data=true - and es1 is the current master). All
three nodes are running ES 1.4.2, 15GB heap, r3.xlarge instances, JDK
1.7.0_72. We are using the AWS-Cloud plugin for ec2 discovery. The
discovery part works fine I think and we haven't had problems there.

What we are seeing is that the cluster is running fine for most of the
time, but periodically (say once every hour or two) we seem to see failures
in the logs on es1 (the master node) with both indexing and with the node
[indices:monitor/stats] apis (these are debug messages) - and they seem to
be happening because the connection between the master node (es1) and
either of the other nodes is lost.

I tried doing searches in this mailing list and then configured tcp keep
alive settings- I think it helped but not really sure since the "node not
connected" errors are still happening.

Here is a section of the master log that shows the exceptions:

[2015-01-08 14:02:52,203][DEBUG][action.admin.indices.stats] [es1]
[alert][0], node[jAhWlTiKTASdHDQaZGVncw], [P], s[STARTED]: failed to
execute
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@2a694684]
org.elasticsearch.transport.NodeDisconnectedException: [es2][inet[/10.109.172.201:9300]][indices:monitor/stats[s]]
disconnected

<....deleted for brevity - Bunch of these exceptions on index stats for
each of the indexes we have....>

[2015-01-08
14:02:52,205][WARN ][action.index ] [es1] Failed to perform
indices:data/write/index on remote replica
[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[
/10.109.172.201:9300]][config][3]
org.elasticsearch.transport.NodeDisconnectedException:
[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected
[2015-01-08 14:02:52,206][WARN ][cluster.action.shard
] [es1] [config][3] sending failed shard for [config][3],
node[jAhWlTiKTASdHDQaZGVncw], [R], s[STARTED], indexUUID
[xnxor01lSTC8dY-0wwPXlQ], reason [Failed to perform
[indices:data/write/index] on replica, message
[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected]]]
[2015-01-08 14:02:52,206][WARN
][cluster.action.shard ] [es1] [config][3] received shard failed for
[config][3], node[jAhWlTiKTASdHDQaZGVncw], [R], s[STARTED], indexUUID
[xnxor01lSTC8dY-0wwPXlQ], reason [Failed to perform
[indices:data/write/index] on replica, message
[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected]]]
....
[2015-01-08 14:02:52,206][WARN
][action.index ] [es1] Failed to perform
indices:data/write/index on remote replica
[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[
/10.109.172.201:9300]][origin_v0101][0]
org.elasticsearch.transport.NodeDisconnectedException:
[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected
[2015-01-08 14:02:52,206][WARN ][cluster.action.shard
] [es1] [origin_v0101][0] sending failed shard for [origin_v0101][0],
node[jAhWlTiKTASdHDQaZGVncw], [R], s[STARTED], indexUUID
[_G8gVWViS6OoX59MHJtwhA], reason [Failed to perform
[indices:data/write/index] on replica, message
[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected]]]
[2015-01-08 14:02:52,206][WARN
][cluster.action.shard ] [es1] [origin_v0101][0] received shard
failed for [origin_v0101][0], node[jAhWlTiKTASdHDQaZGVncw], [R],
s[STARTED], indexUUID [_G8gVWViS6OoX59MHJtwhA], reason [Failed to
perform [indices:data/write/index] on replica, message
[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected]]]
[2015-01-08 14:02:52,206][WARN ][action.index
] [es1] Failed to perform indices:data/write/index on remote
replica
[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[
/10.109.172.201:9300]][origin_v0101][0]
org.elasticsearch.transport.NodeDisconnectedException:
[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected
[2015-01-08 14:02:52,206][WARN ][cluster.action.shard
] [es1] [origin_v0101][0] sending failed shard for [origin_v0101][0],
node[jAhWlTiKTASdHDQaZGVncw], [R], s[STARTED], indexUUID
[_G8gVWViS6OoX59MHJtwhA], reason [Failed to perform
[indices:data/write/index] on replica, message
[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected]]]
[2015-01-08 14:02:52,207][WARN
][cluster.action.shard ] [es1] [origin_v0101][0] received shard
failed for [origin_v0101][0], node[jAhWlTiKTASdHDQaZGVncw], [R],
s[STARTED], indexUUID [_G8gVWViS6OoX59MHJtwhA], reason [Failed to
perform [indices:data/write/index] on replica, message
[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected]]]
....
org.elasticsearch.transport.NodeDisconnectedException:
[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected
[2015-01-08 14:02:52,230][WARN
][cluster.action.shard ] [es1] [origin_v0101][0] sending failed
shard for [origin_v0101][0], node[jAhWlTiKTASdHDQaZGVncw], [R],
s[STARTED], indexUUID [_G8gVWViS6OoX59MHJtwhA], reason [Failed to
perform [indices:data/write/index] on replica, message
[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected]]]
[2015-01-08 14:02:52,230][WARN
][cluster.action.shard ] [es1] [origin_v0101][0] received shard
failed for [origin_v0101][0], node[jAhWlTiKTASdHDQaZGVncw], [R],
s[STARTED], indexUUID [_G8gVWViS6OoX59MHJtwhA], reason [Failed to
perform [indices:data/write/index] on replica, message
[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected]]]
[2015-01-08
14:02:52,230][DEBUG][action.admin.indices.stats] [es1]
[event-v1-20141227][4], node[jAhWlTiKTASdHDQaZGVncw], [R], s[STARTED]:
failed to execute
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@d1974d5]
org.elasticsearch.transport.NodeDisconnectedException:
[es2][inet[/10.109.172.201:9300]][indices:monitor/stats[s]] disconnected
[2015-01-08
14:02:52,227][WARN ][action.index ] [es1] Failed to perform
indices:data/write/index on remote replica
[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[
/10.109.172.201:9300]][origin_v0101][0]
org.elasticsearch.transport.SendRequestTransportException:
[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:213)

at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnReplica(TransportShardReplicationOperationAction.java:669)

at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performReplicas(TransportShardReplicationOperationAction.java:641)

at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:512)

at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:419)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.NodeNotConnectedException:
[es2][inet[/10.109.172.201:9300]] Node not connected
at
org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:946)
at
org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:640)
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:199)
... 7 more
org.elasticsearch.transport.NodeDisconnectedException:
[es2][inet[/10.109.172.201:9300]][indices:monitor/stats[s]] disconnected
[2015-01-08
14:02:52,232][WARN ][action.index ] [es1] Failed to perform
indices:data/write/index on remote replica
[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[/10.109.172.201:9300
]][config][3]
org.elasticsearch.transport.NodeDisconnectedException:
[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected
[2015-01-08 14:02:52,232][WARN ][search.action ] [es1] Failed to
send release search context
org.elasticsearch.transport.SendRequestTransportException:

[es2][inet[/10.109.172.201:9300]][indices:data/read/search[free_context]]
at org.elasticsearch.transport.TransportService.sendRequest(
TransportService.java:213)
at org.elasticsearch.transport.TransportService.sendRequest(
TransportService.java:183)
at org.elasticsearch.search.action.SearchServiceTransportAction.
sendFreeContext(SearchServiceTransportAction.java:143)

at
org.elasticsearch.action.search.type.
TransportSearchTypeAction$BaseAsyncAction.releaseIrrelevantSearchContexts(
TransportSearchTypeAction.java:341)

at
org.elasticsearch.action.search.type.
TransportSearchQueryThenFetchAction$AsyncAction$2.run(
TransportSearchQueryThenFetchAction.java:158)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [es2][inet
[/10.109.172.201:9300]] Node not connected
at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(
NettyTransport.java:946)
at org.elasticsearch.transport.netty.NettyTransport.sendRequest(
NettyTransport.java:640)
at org.elasticsearch.transport.TransportService.sendRequest(
TransportService.java:199)
..
<..... Bunch of these failures then we get the connection and things settle
down again.......>
[2015-01-08
14:02:54,165][INFO ][cluster.service ] [es1] removed
{[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[/10.109.172.201:9300
]],},
reason:
zen-disco-node_failed([es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet
[/10.109.172.201:9300]]),
reason transport disconnected
[2015-01-08 14:03:27,330][INFO
][cluster.service ] [es1] added
{[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[/10.109.172.201:9300
]],},
reason: zen-disco-receive(join from
node[[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[/10.109.172.201:
9300]]])

At the same time on the disconnecting node - es2 - the logs are fairly
minimal/quiet:

[2015-01-08 14:02:55,431][INFO ][discovery.ec2 ] [es2]
master_left [[es1][MJ5njwAqQ_-10imG53WDGw][ip-10-152-16-37][inet[/10.152.
16.37:9300]]], reason [do not exists on master, act as master failure]
[2015-01-08 14:02:55,431][WARN ][discovery.ec2 ] [es2] master
left (reason = do not exists on master, act as master failure), current
nodes: {[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[/10.109.
172.201:9300]],[es3][XVZWtpq7Sc28Cj6C2wd42A][ip-10-79-189-47][inet[/10.79.
189.47:9300]]{master=true},}
[2015-01-08 14:02:55,432][INFO ][cluster.service ] [es2] removed {[
es1][MJ5njwAqQ_-10imG53WDGw][ip-10-152-16-37][inet[/10.152.16.37:9300]],},
reason: zen-disco-master_failed ([es1][MJ5njwAqQ_-10imG53WDGw][ip-10-152-16-
37][inet[/10.152.16.37:9300]])
[2015-01-08 14:03:25,884][INFO ][cluster.service ] [es2]
detected_master [es1][MJ5njwAqQ_-10imG53WDGw][ip-10-152-16-37][inet[/10.152.
16.37:9300]], added {[es1][MJ5njwAqQ_-10imG53WDGw][ip-10-152-16-37][inet[/
10.152.16.37:9300]],}, reason: zen-disco-receive(from master [[es1][
MJ5njwAqQ_-10imG53WDGw][ip-10-152-16-37][inet[/10.152.16.37:9300]]])

sysctl.conf changes:
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_probes = 6
net.ipv4.tcp_keepalive_intvl = 10

Here are our elasticsearch.yml config parameters:
action.disable_delete_all_indices: true
node.name: [es1 OR es2 OR es3]
path.data:
gateway.type: local
gateway.recover_after_nodes: 2
gateway.recover_after_time: 10m
gateway.expected_nodes: 3
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.timeout: 30s
discovery.zen.ping.multicast.enabled: false
cloud:
aws:
access_key:
secret_key:
discovery.type: ec2
discovery.ec2.groups: <group_name>
discovery.ec2.tag.elasticsearch: true
repositories:
s3:
bucket:
region:
base_path:
index.search.slowlog.threshold.query.warn: 10

we plan to raise this but set currently lower than the RAM of 15GB would

allow

indices.fielddata.cache.size: 4.8GB
indices.fielddata.breaker.limit: 5.5GB
http.cors.enabled: true

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5777e0f0-f68a-4c74-93b3-f8dcbdc5d677%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Ranga-2) #2

The issue I am seeing seems similar to what was reported
at https://groups.google.com/forum/#!searchin/elasticsearch/sporadic$20/elasticsearch/jUsoUV3_mbo/nM1OtJ9tmW0J

I enabled more logging on the transport layer and I see the following
exceptions when the disconnect/reconnect happens.

[2015-01-08 17:32:18,216][TRACE][transport.netty ] [es1] close
connection exception caught on transport layer [[id: 0xc4e4b9a1,
/10.152.16.37:59038 => /10.109.172.201:9300]], disconnecting from relevant
node
java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-01-08 17:32:18,217][DEBUG][transport.netty ] [es1]
disconnecting from
[[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[/10.109.172.201:9300]]],
channel closed event
[2015-01-08 17:32:18,217][TRACE][transport.netty ] [es1]
disconnected from
[[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[/10.109.172.201:9300]]],
channel closed event

.......
[2015-01-08 17:32:23,400][DEBUG][transport.netty ] [es1] connected
to node
[[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[/10.109.172.201:9300]]]
[2015-01-08 17:32:24,546][INFO ][cluster.service ] [es1] removed
{[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[/10.109.172.201:9300]],},
reason:
zen-disco-node_failed([es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[/10.109.172.201:9300]]),
reason transport disconnected
[2015-01-08 17:32:26,979][DEBUG][transport.netty ] [es1]
disconnecting from
[[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[/10.109.172.201:9300]]]
due to explicit disconnect call
[2015-01-08 17:32:26,980][TRACE][transport.netty ] [es1]
disconnected from
[[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[/10.109.172.201:9300]]]
due to explicit disconnect call
[2015-01-08 17:32:27,075][TRACE][transport.netty ] [es1] channel
closed: [id: 0xdbb6e4b0, /10.109.172.201:45127 => /10.152.16.37:9300]
[2015-01-08 17:32:27,076][TRACE][transport.netty ] [es1] channel
closed: [id: 0x211fc69e, /10.109.172.201:45128 => /10.152.16.37:9300]
[2015-01-08 17:32:27,076][TRACE][transport.netty ] [es1] channel
closed: [id: 0x8bba710f, /10.109.172.201:45130 => /10.152.16.37:9300]
[2015-01-08 17:32:27,076][TRACE][transport.netty ] [es1] channel
closed: [id: 0x79f540c2, /10.109.172.201:45129 => /10.152.16.37:9300]
[2015-01-08 17:32:27,076][TRACE][transport.netty ] [es1] channel
closed: [id: 0xc65857d4, /10.109.172.201:45133 => /10.152.16.37:9300]
[2015-01-08 17:32:27,076][TRACE][transport.netty ] [es1] channel
closed: [id: 0xbca56b4c, /10.109.172.201:45131 => /10.152.16.37:9300]
[2015-01-08 17:32:27,076][TRACE][transport.netty ] [es1] channel
closed: [id: 0xa3b3f93f, /10.109.172.201:45132 => /10.152.16.37:9300]
[2015-01-08 17:32:27,076][TRACE][transport.netty ] [es1] channel
closed: [id: 0x1400e583, /10.109.172.201:45134 => /10.152.16.37:9300]
[2015-01-08 17:32:27,076][TRACE][transport.netty ] [es1] channel
closed: [id: 0x72d6fccf, /10.109.172.201:45138 => /10.152.16.37:9300]
[2015-01-08 17:32:27,076][TRACE][transport.netty ] [es1] channel
closed: [id: 0xfb686698, /10.109.172.201:45126 => /10.152.16.37:9300]
[2015-01-08 17:32:27,076][TRACE][transport.netty ] [es1] channel
closed: [id: 0xb1633576, /10.109.172.201:45136 => /10.152.16.37:9300]
[2015-01-08 17:32:27,076][TRACE][transport.netty ] [es1] channel
closed: [id: 0x782b9080, /10.109.172.201:45137 => /10.152.16.37:9300]
[2015-01-08 17:32:27,077][TRACE][transport.netty ] [es1] channel
closed: [id: 0x7ded7f98, /10.109.172.201:45135 => /10.152.16.37:9300]
[2015-01-08 17:32:27,301][TRACE][transport.netty ] [es1] channel
opened: [id: 0x9662aeb5, /10.109.172.201:45156 => /10.152.16.37:9300]
[2015-01-08 17:32:27,301][TRACE][transport.netty ] [es1] channel
opened: [id: 0x62567a47, /10.109.172.201:45155 => /10.152.16.37:9300]
[2015-01-08 17:32:27,301][TRACE][transport.netty ] [es1] channel
opened: [id: 0x09ad440e, /10.109.172.201:45157 => /10.152.16.37:9300]
[2015-01-08 17:32:27,302][TRACE][transport.netty ] [es1] channel
opened: [id: 0x30d6d25d, /10.109.172.201:45158 => /10.152.16.37:9300]
[2015-01-08 17:32:27,302][TRACE][transport.netty ] [es1] channel
opened: [id: 0x421bb20b, /10.109.172.201:45159 => /10.152.16.37:9300]
[2015-01-08 17:32:27,302][TRACE][transport.netty ] [es1] channel
opened: [id: 0xd83d71b5, /10.109.172.201:45160 => /10.152.16.37:9300]
[2015-01-08 17:32:27,302][TRACE][transport.netty ] [es1] channel
opened: [id: 0xde607f1d, /10.109.172.201:45162 => /10.152.16.37:9300]
[2015-01-08 17:32:27,302][TRACE][transport.netty ] [es1] channel
opened: [id: 0x161c9ae0, /10.109.172.201:45163 => /10.152.16.37:9300]
[2015-01-08 17:32:27,302][TRACE][transport.netty ] [es1] channel
opened: [id: 0x960d6ce3, /10.109.172.201:45161 => /10.152.16.37:9300]
[2015-01-08 17:32:27,302][TRACE][transport.netty ] [es1] channel
opened: [id: 0x8277c29e, /10.109.172.201:45165 => /10.152.16.37:9300]
[2015-01-08 17:32:27,303][TRACE][transport.netty ] [es1] channel
opened: [id: 0xe046cbc3, /10.109.172.201:45164 => /10.152.16.37:9300]
[2015-01-08 17:32:27,303][TRACE][transport.netty ] [es1] channel
opened: [id: 0x22233acf, /10.109.172.201:45166 => /10.152.16.37:9300]
[2015-01-08 17:32:27,303][TRACE][transport.netty ] [es1] channel
opened: [id: 0x5880bff7, /10.109.172.201:45167 => /10.152.16.37:9300]
[2015-01-08 17:32:42,422][TRACE][transport.netty ] [es1] channel
opened: [id: 0xe02aca38, /10.109.172.201:45168 => /10.152.16.37:9300]
[2015-01-08 17:32:57,535][TRACE][transport.netty ] [es1] channel
opened: [id: 0x489f219c, /10.109.172.201:45169 => /10.152.16.37:9300]
[2015-01-08 17:32:57,536][TRACE][transport.netty ] [es1] channel
closed: [id: 0xe02aca38, /10.109.172.201:45168 => /10.152.16.37:9300]
[2015-01-08 17:32:57,537][TRACE][transport.netty ] [es1] channel
closed: [id: 0x489f219c, /10.109.172.201:45169 => /10.152.16.37:9300]
[2015-01-08 17:32:57,540][DEBUG][transport.netty ] [es1]
connected to node
[[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[/10.109.172.201:9300]]]
[2015-01-08 17:32:57,541][INFO ][cluster.service ] [es1] added
{[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[/10.109.172.201:9300]],},
reason: zen-disco-receive(join from
node[[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[/10.109.172.201:9300]]])

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ca1af751-4210-42c7-8c45-e9e2df3a1c15%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Matt Wise) #3

I believe that we're seeing the same issue. We're using Ubuntu 14.04, ES
1.4.4 and the AWS plugin. We get this random failures every few weeks and
have to restart our cluster:

Caused by: org.elasticsearch.transport.NodeNotConnectedException:
[prod-flume-vpc-es-useast1-58-i-1ff961e3-flume-elasticsearch-production_vpc-useast1][inet[/10.48.49.100:9300]]
Node not connected

Note, previously this cluster ran the 0,90 release with the Zookeeper
plugin and these failures would not happen. It is a pretty large cluster
(~30 days of logs * 10 shards * 2x redundancy, with ~200-250GB of logs per
day), but I don't think the machines are over taxed.

On Thursday, January 8, 2015 at 8:08:38 AM UTC-8, Ranga wrote:

Hi,

I have a cluster of 3-node cluster in EC2 - and am seeing frequent
NodeNotConnectedException related errors which cause intermittent failures
during indexing. I'm hoping some one knows what this is able and can help.

Thanks in advance for your help - Here are the details -

There are 3 nodes (es1, es2 and es3 - all are defined to be
node.master=true, node.data=true - and es1 is the current master). All
three nodes are running ES 1.4.2, 15GB heap, r3.xlarge instances, JDK
1.7.0_72. We are using the AWS-Cloud plugin for ec2 discovery. The
discovery part works fine I think and we haven't had problems there.

What we are seeing is that the cluster is running fine for most of the
time, but periodically (say once every hour or two) we seem to see failures
in the logs on es1 (the master node) with both indexing and with the node
[indices:monitor/stats] apis (these are debug messages) - and they seem to
be happening because the connection between the master node (es1) and
either of the other nodes is lost.

I tried doing searches in this mailing list and then configured tcp keep
alive settings- I think it helped but not really sure since the "node not
connected" errors are still happening.

Here is a section of the master log that shows the exceptions:

[2015-01-08 14:02:52,203][DEBUG][action.admin.indices.stats] [es1]
[alert][0], node[jAhWlTiKTASdHDQaZGVncw], [P], s[STARTED]: failed to
execute
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@2a694684
]
org.elasticsearch.transport.NodeDisconnectedException: [es2][inet[/10.109.172.201:9300]][indices:monitor/stats[s]]
disconnected

<....deleted for brevity - Bunch of these exceptions on index stats for
each of the indexes we have....>

[2015-01-08
14:02:52,205][WARN ][action.index ] [es1] Failed to perform
indices:data/write/index on remote replica
[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[
/10.109.172.201:9300]][config][3]
org.elasticsearch.transport.NodeDisconnectedException:
[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected
[2015-01-08 14:02:52,206][WARN ][cluster.action.shard
] [es1] [config][3] sending failed shard for [config][3],
node[jAhWlTiKTASdHDQaZGVncw], [R], s[STARTED], indexUUID
[xnxor01lSTC8dY-0wwPXlQ], reason [Failed to perform
[indices:data/write/index] on replica, message

[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected]]]
[2015-01-08 14:02:52,206][WARN
][cluster.action.shard ] [es1] [config][3] received shard failed for
[config][3], node[jAhWlTiKTASdHDQaZGVncw], [R], s[STARTED], indexUUID
[xnxor01lSTC8dY-0wwPXlQ], reason [Failed to perform
[indices:data/write/index] on replica, message

[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected]]]
....
[2015-01-08 14:02:52,206][WARN
][action.index ] [es1] Failed to perform
indices:data/write/index on remote replica
[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[
/10.109.172.201:9300]][origin_v0101][0]
org.elasticsearch.transport.NodeDisconnectedException:
[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected
[2015-01-08 14:02:52,206][WARN ][cluster.action.shard
] [es1] [origin_v0101][0] sending failed shard for [origin_v0101][0],
node[jAhWlTiKTASdHDQaZGVncw], [R], s[STARTED], indexUUID
[_G8gVWViS6OoX59MHJtwhA], reason [Failed to perform
[indices:data/write/index] on replica, message

[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected]]]
[2015-01-08 14:02:52,206][WARN
][cluster.action.shard ] [es1] [origin_v0101][0] received shard
failed for [origin_v0101][0], node[jAhWlTiKTASdHDQaZGVncw], [R],
s[STARTED], indexUUID [_G8gVWViS6OoX59MHJtwhA], reason [Failed to
perform [indices:data/write/index] on replica, message

[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected]]]
[2015-01-08 14:02:52,206][WARN ][action.index
] [es1] Failed to perform indices:data/write/index on remote
replica
[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[
/10.109.172.201:9300]][origin_v0101][0]
org.elasticsearch.transport.NodeDisconnectedException:
[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected
[2015-01-08 14:02:52,206][WARN ][cluster.action.shard
] [es1] [origin_v0101][0] sending failed shard for [origin_v0101][0],
node[jAhWlTiKTASdHDQaZGVncw], [R], s[STARTED], indexUUID
[_G8gVWViS6OoX59MHJtwhA], reason [Failed to perform
[indices:data/write/index] on replica, message

[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected]]]
[2015-01-08 14:02:52,207][WARN
][cluster.action.shard ] [es1] [origin_v0101][0] received shard
failed for [origin_v0101][0], node[jAhWlTiKTASdHDQaZGVncw], [R],
s[STARTED], indexUUID [_G8gVWViS6OoX59MHJtwhA], reason [Failed to
perform [indices:data/write/index] on replica, message

[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected]]]
....
org.elasticsearch.transport.NodeDisconnectedException:
[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected
[2015-01-08 14:02:52,230][WARN
][cluster.action.shard ] [es1] [origin_v0101][0] sending failed
shard for [origin_v0101][0], node[jAhWlTiKTASdHDQaZGVncw], [R],
s[STARTED], indexUUID [_G8gVWViS6OoX59MHJtwhA], reason [Failed to
perform [indices:data/write/index] on replica, message

[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected]]]
[2015-01-08 14:02:52,230][WARN
][cluster.action.shard ] [es1] [origin_v0101][0] received shard
failed for [origin_v0101][0], node[jAhWlTiKTASdHDQaZGVncw], [R],
s[STARTED], indexUUID [_G8gVWViS6OoX59MHJtwhA], reason [Failed to
perform [indices:data/write/index] on replica, message

[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected]]]
[2015-01-08
14:02:52,230][DEBUG][action.admin.indices.stats] [es1]
[event-v1-20141227][4], node[jAhWlTiKTASdHDQaZGVncw], [R], s[STARTED]:
failed to execute
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@d1974d5]
org.elasticsearch.transport.NodeDisconnectedException:
[es2][inet[/10.109.172.201:9300]][indices:monitor/stats[s]] disconnected
[2015-01-08
14:02:52,227][WARN ][action.index ] [es1] Failed to perform
indices:data/write/index on remote replica
[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[
/10.109.172.201:9300]][origin_v0101][0]
org.elasticsearch.transport.SendRequestTransportException:
[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:213)

at

org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnReplica(TransportShardReplicationOperationAction.java:669)

at

org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performReplicas(TransportShardReplicationOperationAction.java:641)

at

org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:512)

at

org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:419)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.NodeNotConnectedException:
[es2][inet[/10.109.172.201:9300]] Node not connected
at
org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:946)
at
org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:640)
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:199)
... 7 more
org.elasticsearch.transport.NodeDisconnectedException:
[es2][inet[/10.109.172.201:9300]][indices:monitor/stats[s]] disconnected
[2015-01-08
14:02:52,232][WARN ][action.index ] [es1] Failed to perform
indices:data/write/index on remote replica
[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[/10.109.172.201:9300
]][config][3]
org.elasticsearch.transport.NodeDisconnectedException:
[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected
[2015-01-08 14:02:52,232][WARN ][search.action ] [es1] Failed
to send release search context
org.elasticsearch.transport.SendRequestTransportException:

[es2][inet[/10.109.172.201:9300]][indices:data/read/search[free_context]]
at org.elasticsearch.transport.TransportService.sendRequest(
TransportService.java:213)
at org.elasticsearch.transport.TransportService.sendRequest(
TransportService.java:183)
at org.elasticsearch.search.action.SearchServiceTransportAction.
sendFreeContext(SearchServiceTransportAction.java:143)

at
org.elasticsearch.action.search.type.
TransportSearchTypeAction$BaseAsyncAction.releaseIrrelevantSearchContexts(
TransportSearchTypeAction.java:341)

at
org.elasticsearch.action.search.type.
TransportSearchQueryThenFetchAction$AsyncAction$2.run(
TransportSearchQueryThenFetchAction.java:158)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [es2][
inet[/10.109.172.201:9300]] Node not connected
at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(
NettyTransport.java:946)
at org.elasticsearch.
...

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7ea5bc0a-7bf9-4842-935c-1fd7cf90bfc2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Vincent Biret) #4

Hello everyone,

Actually we had a lot of network issues between our nodes on AWS (nodes are on CentOS 7), looking into kernel messages I got several messages like :

xennet: skb rides the rocket: 19 slots
xennet: skb rides the rocket: 19 slots
xennet: skb rides the rocket: 19 slots

This bug is referred here :
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1317811

Fix: Disable scatter-gather on your network card

sudo ethtool -K eth0 sg off

Since, I don't have any other problem and our cluster is super healthy !


The Pipeline is Blocked, due to node_not_connected_exception with connection to Data Nodes
(system) #5