I've have a 2 node cluster (configured for 1 replica) running for a
couple weeks. Looks like it started to get unhappy, throwing lots of
the messages below in the logs and eventually throwing lots of the
subsequent exceptions. This all ended in the cluster getting into an
unhappy state (red on node1, yellow on node2) . node2 was listed in
the cluster nodes when querying node1, but there were 2 indexes that
showed they had some recovering shards with no apparent progress and 4
more shards that were unassigned.
I was able to track down that one of our crons actually pruned an old
index from the second node in the cluster as part of clean-up
(logstash-2012.05.29). I'm guessing that ES didn't like the fact that
the index was deleted from that second node and subsequently decided
to ignore that nodes participation in the cluster? I ended up
deleting that same index from node1 and then restarting node2, and the
cluster started recovering with node2 basically getting completely
wiped and sync'ng everything from node1 from scratch. Did node1
decide that node2 was in a bad state due to the index deletion issue
and decide to wipe it completely and re-copy all the indexes/shards to
it? That's a rather expensive behavior, so my question is can it be
avoided?
Thoughts? Log files entries below.
==
The first node in the cluster had these logs (lots of the
UnavailableShardsException, followed by the IndexMissingException).
[2012-06-06 20:01:00,041][WARN ][river.rabbitmq ]
[sl02.mail01] [rabbitmq][logstash-n05_mail01] failed to executefailure
in bulk execution:
[0]: index [logstash-2012.06.06], type [linux-syslog], id
[s1aaRxwSTemVxq4aQT0yxw], message
[UnavailableShardsException[[logstash-2012.06.06][3] [2] shardIt, [0]
active : Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@41e32c8e]]
[3]: index [logstash-2012.06.06], type [linux-syslog], id
[m0u7l0fSQbSKNXWh_JrV_w], message
[UnavailableShardsException[[logstash-2012.06.06][2] [2] shardIt, [0]
active : Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@2982406d]]
[8]: index [logstash-2012.06.06], type [linux-syslog], id
[m2meqViiSkeY3sl-1QgCxw], message
[UnavailableShardsException[[logstash-2012.06.06][3] [2] shardIt, [0]
active : Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@41e32c8e]]
[2012-06-06 20:10:01,806][DEBUG][action.admin.indices.status]
[sl02.mail01] [logstash-2012.05.29][4], node[aTLu9iSzQSG3rgKnp5wWRw],
[R], s[STARTED]: Failed to execute
[org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@2f8f4c9f]
org.elasticsearch.transport.RemoteTransportException: [sl03.mail01]
[inet[/192.168.1.113:9300]][indices/status/s]
Caused by: org.elasticsearch.indices.IndexMissingException:
[logstash-2012.05.29] missing
at
org.elasticsearch.indices.InternalIndicesService.indexServiceSafe(InternalIndicesService.java:
243)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:
152)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:
59)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction
$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:
398)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction
$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:
384)
at org.elasticsearch.transport.netty.MessageChannelHandler
$RequestHandler.run(MessageChannelHandler.java:374)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)
The second node in the cluster has the following logs:
[2012-06-06 20:00:00,111][INFO ][cluster.metadata ]
[sl03.mail01] [logstash-2012.06.07] creating index, cause [auto(bulk
api)], shards [5]/[1], mappings
[2012-06-06 20:00:00,211][WARN ][discovery.zen ]
[sl03.mail01] master should not receive new cluster state from
[[sl02.mail01][6ohKxf-IQj6lYxB8EaKxKg][inet[/192.168.1.112:9300]]]
[2012-06-06 20:00:00,298][WARN ][discovery.zen ]
[sl03.mail01] master should not receive new cluster state from
[[sl02.mail01][6ohKxf-IQj6lYxB8EaKxKg][inet[/192.168.1.112:9300]]]
[2012-06-06 20:00:00,688][WARN ][discovery.zen ]
[sl03.mail01] master should not receive new cluster state from
[[sl02.mail01][6ohKxf-IQj6lYxB8EaKxKg][inet[/192.168.1.112:9300]]]
[2012-06-06 20:00:00,691][WARN ][discovery.zen ]
[sl03.mail01] master should not receive new cluster state from
[[sl02.mail01][6ohKxf-IQj6lYxB8EaKxKg][inet[/192.168.1.112:9300]]]
[2012-06-06 20:00:00,905][INFO ][cluster.metadata ]
[sl03.mail01] [logstash-2012.06.07] update_mapping [linux-syslog]
(dynamic)
[2012-06-06 20:00:02,016][WARN ][discovery.zen ]
[sl03.mail01] master should not receive new cluster state from
[[sl02.mail01][6ohKxf-IQj6lYxB8EaKxKg][inet[/192.168.1.112:9300]]]
[2012-06-06 20:00:02,207][INFO ][cluster.metadata ]
[sl03.mail01] [logstash-2012.05.31] deleting index
[2012-06-07 00:06:33,253][WARN ][monitor.jvm ]
[sl03.mail01] [gc][ParNew][539986][123575] duration [1.5s],
collections [1]/[1.6s], total [1.5s]/[1.2h], memory [11.9gb]->[11.7gb]/
[15.9gb]
[2012-06-07 07:01:47,562][WARN ][monitor.jvm ]
[sl03.mail01] [gc][ParNew][564889][127077] duration [3.6s],
collections [1]/[4.1s], total [3.6s]/[1.2h], memory [11.3gb]->[11.1gb]/
[15.9gb]
[2012-06-07 15:26:59,238][WARN ][monitor.jvm ]
[sl03.mail01] [gc][ParNew][595186][133891] duration [1.2s],
collections [1]/[1.4s], total [1.2s]/[1.3h], memory [10.3gb]->[10.1gb]/
[15.9gb]
[2012-06-07 20:00:00,154][INFO ][cluster.metadata ]
[sl03.mail01] [logstash-2012.06.08] creating index, cause [auto(bulk
api)], shards [5]/[1], mappings
[2012-06-07 20:00:00,275][WARN ][discovery.zen ]
[sl03.mail01] master should not receive new cluster state from
[[sl02.mail01][6ohKxf-IQj6lYxB8EaKxKg][inet[/192.168.1.112:9300]]]
[2012-06-07 20:00:00,419][WARN ][discovery.zen ]
[sl03.mail01] master should not receive new cluster state from
[[sl02.mail01][6ohKxf-IQj6lYxB8EaKxKg][inet[/192.168.1.112:9300]]]
[2012-06-07 20:00:00,546][WARN ][discovery.zen ]
[sl03.mail01] master should not receive new cluster state from
[[sl02.mail01][6ohKxf-IQj6lYxB8EaKxKg][inet[/192.168.1.112:9300]]]
[2012-06-07 20:00:00,550][WARN ][discovery.zen ]
[sl03.mail01] master should not receive new cluster state from
[[sl02.mail01][6ohKxf-IQj6lYxB8EaKxKg][inet[/192.168.1.112:9300]]]
[2012-06-07 20:00:00,961][INFO ][cluster.metadata ]
[sl03.mail01] [logstash-2012.06.08] update_mapping [linux-syslog]
(dynamic)
[2012-06-07 20:00:01,580][INFO ][cluster.metadata ]
[sl03.mail01] [logstash-2012.06.01] deleting index
[2012-06-07 20:10:01,384][WARN ][discovery.zen ]
[sl03.mail01] master should not receive new cluster state from
[[sl02.mail01][6ohKxf-IQj6lYxB8EaKxKg][inet[/192.168.1.112:9300]]]
[2012-06-08 08:52:42,269][WARN ][monitor.jvm ]
[sl03.mail01] [gc][ParNew][657898][144133] duration [2.8s],
collections [1]/[2.8s], total [2.8s]/[1.4h], memory [10.6gb]->[10.5gb]/
[15.9gb]
[2012-06-08 09:01:21,987][WARN ][monitor.jvm ]
[sl03.mail01] [gc][ParNew][658412][144249] duration [5s], collections
[1]/[5.9s], total [5s]/[1.4h], memory [10.8gb]->[10.7gb]/[15.9gb]
[2012-06-08 09:01:36,775][WARN ][monitor.jvm ]
[sl03.mail01] [gc][ParNew][658426][144250] duration [1s], collections
[1]/[1.7s], total [1s]/[1.4h], memory [10.8gb]->[10.8gb]/[15.9gb]
[2012-06-08 09:16:13,234][WARN ][monitor.jvm ]
[sl03.mail01] [gc][ParNew][659284][144484] duration [18.7s],
collections [1]/[19.1s], total [18.7s]/[1.4h], memory [11.2gb]-
[11.1gb]/[15.9gb]
[2012-06-08 09:26:07,903][WARN ][monitor.jvm ]
[sl03.mail01] [gc][ParNew][659877][144662] duration [1s], collections
[1]/[1.9s], total [1s]/[1.4h], memory [11.6gb]->[11.5gb]/[15.9gb]
[2012-06-08 09:35:05,191][WARN ][monitor.jvm ]
[sl03.mail01] [gc][ParNew][660412][144806] duration [1.7s],
collections [1]/[2.7s], total [1.7s]/[1.4h], memory [11.8gb]->[11.7gb]/
[15.9gb]