Elasticsearch 1.4 node loss discovery

I'm currently testing Elasticsearch resilience (version 1.4.5) when a routing allocation zone is lost. These zones are AWS availability zones.

The cluster configuration is 6 nodes, all master eligible, with 3 zones, 2 in each, so;

es01- Zone A - Master Node
es02- Zone B
es03- Zone C
es04- Zone A
es05- Zone B
es06- Zone C

The process is simple:

  • Stop issued to es03 and es06 via AWS console.

The cluster then takes around 15 minutes to return to an operable state;

  • For around 15 minutes the cluster health reports as having 6 nodes (this should be 4) and a green health.
  • After this it changes to 4, with the logs showing that es03 and es06 have been removed.
  • For nearly all of this time the logs show:

failed to list shard stores on node [gn0RQW3kQQCKQl6x1aL2pw]
org.elasticsearch.action.FailedNodeException: Failed node [gn0RQW3kQQCKQl6x1aL2pw]
at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.onFailure(TransportNodesOperationAction.java:206)
at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$1000(TransportNodesOperationAction.java:97)
at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction$4.handleException(TransportNodesOperationAction.java:178)
at org.elasticsearch.transport.TransportService$3.run(TransportService.java:217)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.SendRequestTransportException: [pp-pod2-es03-apollo][inet[/10.244.43.10:9300]][internal:cluster/nodes/indices/shard/store[n]]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:213)
at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.start(TransportNodesOperationAction.java:165)
at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$300(TransportNodesOperationAction.java:97)
at org.elasticsearch.action.support.nodes.TransportNodesOperationAction.doExecute(TransportNodesOperationAction.java:70)
at org.elasticsearch.action.support.nodes.TransportNodesOperationAction.doExecute(TransportNodesOperationAction.java:43)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:75)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:55)
at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.list(TransportNodesListShardStoreMetaData.java:79)
at org.elasticsearch.gateway.local.LocalGatewayAllocator.buildShardStores(LocalGatewayAllocator.java:454)
at org.elasticsearch.gateway.local.LocalGatewayAllocator.allocateUnassigned(LocalGatewayAllocator.java:292)
at org.elasticsearch.cluster.routing.allocation.allocator.ShardsAllocators.allocateUnassigned(ShardsAllocators.java:74)
at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:217)
at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:160)
at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:146)
at org.elasticsearch.discovery.zen.ZenDiscovery$5.execute(ZenDiscovery.java:538)
at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:347)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:184)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:154)
... 3 more
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [pp-pod2-es03-apollo][inet[/10.244.43.10:9300]] Node not connected
at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:946)
at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:640)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:199)
... 20 more

The cluster configuration is:

cluster.name: testcluster
cluster.routing.allocation.awareness.atrributes: zone

discovery.zen.minimum_master_nodes: 4
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["es01", "es02", "es03", "es04", "es05", "es06"]

jmx.create_connector: true

node.max_local_storage_nodes: 1
node.name: es01
node.zone: Zone A

path.data: /tmp/es01

threadpool.bulk.queue_size: 100
threadpool.bulk.size: 30
threadpool.bulk.type: fixed

action.auto_create_index: false

Is anyone able to provide any advice on why it takes so long for the cluster to detect that es03 and es06 have been lost?

Thanks,

Brent

I'd start by upgrading, there are known issues with older versions that can cause data corruption and other general resiliency issues. See https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html for more.