Hi,
I have a cluster of 3-node cluster in EC2 - and am seeing frequent
NodeNotConnectedException related errors which cause intermittent failures
during indexing. I'm hoping some one knows what this is able and can help.
Thanks in advance for your help - Here are the details -
There are 3 nodes (es1, es2 and es3 - all are defined to be
node.master=true, node.data=true - and es1 is the current master). All
three nodes are running ES 1.4.2, 15GB heap, r3.xlarge instances, JDK
1.7.0_72. We are using the AWS-Cloud plugin for ec2 discovery. The
discovery part works fine I think and we haven't had problems there.
What we are seeing is that the cluster is running fine for most of the
time, but periodically (say once every hour or two) we seem to see failures
in the logs on es1 (the master node) with both indexing and with the node
[indices:monitor/stats] apis (these are debug messages) - and they seem to
be happening because the connection between the master node (es1) and
either of the other nodes is lost.
I tried doing searches in this mailing list and then configured tcp keep
alive settings- I think it helped but not really sure since the "node not
connected" errors are still happening.
Here is a section of the master log that shows the exceptions:
[2015-01-08 14:02:52,203][DEBUG][action.admin.indices.stats] [es1]
[alert][0], node[jAhWlTiKTASdHDQaZGVncw], [P], s[STARTED]: failed to
execute
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@2a694684]
org.elasticsearch.transport.NodeDisconnectedException: [es2][inet[/10.109.172.201:9300]][indices:monitor/stats[s]]
disconnected
<....deleted for brevity - Bunch of these exceptions on index stats for
each of the indexes we have....>
[2015-01-08
14:02:52,205][WARN ][action.index ] [es1] Failed to perform
indices:data/write/index on remote replica
[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[
/10.109.172.201:9300]][config][3]
org.elasticsearch.transport.NodeDisconnectedException:
[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected
[2015-01-08 14:02:52,206][WARN ][cluster.action.shard
] [es1] [config][3] sending failed shard for [config][3],
node[jAhWlTiKTASdHDQaZGVncw], [R], s[STARTED], indexUUID
[xnxor01lSTC8dY-0wwPXlQ], reason [Failed to perform
[indices:data/write/index] on replica, message
[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected]]]
[2015-01-08 14:02:52,206][WARN
][cluster.action.shard ] [es1] [config][3] received shard failed for
[config][3], node[jAhWlTiKTASdHDQaZGVncw], [R], s[STARTED], indexUUID
[xnxor01lSTC8dY-0wwPXlQ], reason [Failed to perform
[indices:data/write/index] on replica, message
[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected]]]
....
[2015-01-08 14:02:52,206][WARN
][action.index ] [es1] Failed to perform
indices:data/write/index on remote replica
[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[
/10.109.172.201:9300]][origin_v0101][0]
org.elasticsearch.transport.NodeDisconnectedException:
[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected
[2015-01-08 14:02:52,206][WARN ][cluster.action.shard
] [es1] [origin_v0101][0] sending failed shard for [origin_v0101][0],
node[jAhWlTiKTASdHDQaZGVncw], [R], s[STARTED], indexUUID
[_G8gVWViS6OoX59MHJtwhA], reason [Failed to perform
[indices:data/write/index] on replica, message
[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected]]]
[2015-01-08 14:02:52,206][WARN
][cluster.action.shard ] [es1] [origin_v0101][0] received shard
failed for [origin_v0101][0], node[jAhWlTiKTASdHDQaZGVncw], [R],
s[STARTED], indexUUID [_G8gVWViS6OoX59MHJtwhA], reason [Failed to
perform [indices:data/write/index] on replica, message
[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected]]]
[2015-01-08 14:02:52,206][WARN ][action.index
] [es1] Failed to perform indices:data/write/index on remote
replica
[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[
/10.109.172.201:9300]][origin_v0101][0]
org.elasticsearch.transport.NodeDisconnectedException:
[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected
[2015-01-08 14:02:52,206][WARN ][cluster.action.shard
] [es1] [origin_v0101][0] sending failed shard for [origin_v0101][0],
node[jAhWlTiKTASdHDQaZGVncw], [R], s[STARTED], indexUUID
[_G8gVWViS6OoX59MHJtwhA], reason [Failed to perform
[indices:data/write/index] on replica, message
[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected]]]
[2015-01-08 14:02:52,207][WARN
][cluster.action.shard ] [es1] [origin_v0101][0] received shard
failed for [origin_v0101][0], node[jAhWlTiKTASdHDQaZGVncw], [R],
s[STARTED], indexUUID [_G8gVWViS6OoX59MHJtwhA], reason [Failed to
perform [indices:data/write/index] on replica, message
[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected]]]
....
org.elasticsearch.transport.NodeDisconnectedException:
[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected
[2015-01-08 14:02:52,230][WARN
][cluster.action.shard ] [es1] [origin_v0101][0] sending failed
shard for [origin_v0101][0], node[jAhWlTiKTASdHDQaZGVncw], [R],
s[STARTED], indexUUID [_G8gVWViS6OoX59MHJtwhA], reason [Failed to
perform [indices:data/write/index] on replica, message
[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected]]]
[2015-01-08 14:02:52,230][WARN
][cluster.action.shard ] [es1] [origin_v0101][0] received shard
failed for [origin_v0101][0], node[jAhWlTiKTASdHDQaZGVncw], [R],
s[STARTED], indexUUID [_G8gVWViS6OoX59MHJtwhA], reason [Failed to
perform [indices:data/write/index] on replica, message
[NodeDisconnectedException[[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected]]]
[2015-01-08
14:02:52,230][DEBUG][action.admin.indices.stats] [es1]
[event-v1-20141227][4], node[jAhWlTiKTASdHDQaZGVncw], [R], s[STARTED]:
failed to execute
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@d1974d5]
org.elasticsearch.transport.NodeDisconnectedException:
[es2][inet[/10.109.172.201:9300]][indices:monitor/stats[s]] disconnected
[2015-01-08
14:02:52,227][WARN ][action.index ] [es1] Failed to perform
indices:data/write/index on remote replica
[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[
/10.109.172.201:9300]][origin_v0101][0]
org.elasticsearch.transport.SendRequestTransportException:
[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:213)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnReplica(TransportShardReplicationOperationAction.java:669)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performReplicas(TransportShardReplicationOperationAction.java:641)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:512)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:419)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.NodeNotConnectedException:
[es2][inet[/10.109.172.201:9300]] Node not connected
at
org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:946)
at
org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:640)
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:199)
... 7 more
org.elasticsearch.transport.NodeDisconnectedException:
[es2][inet[/10.109.172.201:9300]][indices:monitor/stats[s]] disconnected
[2015-01-08
14:02:52,232][WARN ][action.index ] [es1] Failed to perform
indices:data/write/index on remote replica
[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[/10.109.172.201:9300
]][config][3]
org.elasticsearch.transport.NodeDisconnectedException:
[es2][inet[/10.109.172.201:9300]][indices:data/write/index[r]]
disconnected
[2015-01-08 14:02:52,232][WARN ][search.action ] [es1] Failed to
send release search context
org.elasticsearch.transport.SendRequestTransportException:
[es2][inet[/10.109.172.201:9300]][indices:data/read/search[free_context]]
at org.elasticsearch.transport.TransportService.sendRequest(
TransportService.java:213)
at org.elasticsearch.transport.TransportService.sendRequest(
TransportService.java:183)
at org.elasticsearch.search.action.SearchServiceTransportAction.
sendFreeContext(SearchServiceTransportAction.java:143)
at
org.elasticsearch.action.search.type.
TransportSearchTypeAction$BaseAsyncAction.releaseIrrelevantSearchContexts(
TransportSearchTypeAction.java:341)
at
org.elasticsearch.action.search.type.
TransportSearchQueryThenFetchAction$AsyncAction$2.run(
TransportSearchQueryThenFetchAction.java:158)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [es2][inet
[/10.109.172.201:9300]] Node not connected
at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(
NettyTransport.java:946)
at org.elasticsearch.transport.netty.NettyTransport.sendRequest(
NettyTransport.java:640)
at org.elasticsearch.transport.TransportService.sendRequest(
TransportService.java:199)
..
<..... Bunch of these failures then we get the connection and things settle
down again.......>
[2015-01-08
14:02:54,165][INFO ][cluster.service ] [es1] removed
{[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[/10.109.172.201:9300
]],},
reason:
zen-disco-node_failed([es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet
[/10.109.172.201:9300]]),
reason transport disconnected
[2015-01-08 14:03:27,330][INFO
][cluster.service ] [es1] added
{[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[/10.109.172.201:9300
]],},
reason: zen-disco-receive(join from
node[[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[/10.109.172.201:
9300]]])
At the same time on the disconnecting node - es2 - the logs are fairly
minimal/quiet:
[2015-01-08 14:02:55,431][INFO ][discovery.ec2 ] [es2]
master_left [[es1][MJ5njwAqQ_-10imG53WDGw][ip-10-152-16-37][inet[/10.152.
16.37:9300]]], reason [do not exists on master, act as master failure]
[2015-01-08 14:02:55,431][WARN ][discovery.ec2 ] [es2] master
left (reason = do not exists on master, act as master failure), current
nodes: {[es2][jAhWlTiKTASdHDQaZGVncw][ip-10-109-172-201][inet[/10.109.
172.201:9300]],[es3][XVZWtpq7Sc28Cj6C2wd42A][ip-10-79-189-47][inet[/10.79.
189.47:9300]]{master=true},}
[2015-01-08 14:02:55,432][INFO ][cluster.service ] [es2] removed {[
es1][MJ5njwAqQ_-10imG53WDGw][ip-10-152-16-37][inet[/10.152.16.37:9300]],},
reason: zen-disco-master_failed ([es1][MJ5njwAqQ_-10imG53WDGw][ip-10-152-16-
37][inet[/10.152.16.37:9300]])
[2015-01-08 14:03:25,884][INFO ][cluster.service ] [es2]
detected_master [es1][MJ5njwAqQ_-10imG53WDGw][ip-10-152-16-37][inet[/10.152.
16.37:9300]], added {[es1][MJ5njwAqQ_-10imG53WDGw][ip-10-152-16-37][inet[/
10.152.16.37:9300]],}, reason: zen-disco-receive(from master [[es1][
MJ5njwAqQ_-10imG53WDGw][ip-10-152-16-37][inet[/10.152.16.37:9300]]])
sysctl.conf changes:
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_probes = 6
net.ipv4.tcp_keepalive_intvl = 10
Here are our elasticsearch.yml config parameters:
action.disable_delete_all_indices: true
node.name: [es1 OR es2 OR es3]
path.data:
gateway.type: local
gateway.recover_after_nodes: 2
gateway.recover_after_time: 10m
gateway.expected_nodes: 3
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.timeout: 30s
discovery.zen.ping.multicast.enabled: false
cloud:
aws:
access_key:
secret_key:
discovery.type: ec2
discovery.ec2.groups: <group_name>
discovery.ec2.tag.elasticsearch: true
repositories:
s3:
bucket:
region:
base_path:
index.search.slowlog.threshold.query.warn: 10
we plan to raise this but set currently lower than the RAM of 15GB would
allow
indices.fielddata.cache.size: 4.8GB
indices.fielddata.breaker.limit: 5.5GB
http.cors.enabled: true
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5777e0f0-f68a-4c74-93b3-f8dcbdc5d677%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.