I'm currently testing Elasticsearch resilience (version 1.4.5) when a routing allocation zone is lost. These zones are AWS availability zones.
The cluster configuration is 6 nodes, all master eligible, with 3 zones, 2 in each, so;
es01- Zone A - Master Node
es02- Zone B
es03- Zone C
es04- Zone A
es05- Zone B
es06- Zone C
The process is simple:
- Stop issued to es03 and es06 via AWS console.
The cluster then takes around 15 minutes to return to an operable state;
- For around 15 minutes the cluster health reports as having 6 nodes (this should be 4) and a green health.
- After this it changes to 4, with the logs showing that es03 and es06 have been removed.
- For nearly all of this time the logs show:
failed to list shard stores on node [gn0RQW3kQQCKQl6x1aL2pw]
org.elasticsearch.action.FailedNodeException: Failed node [gn0RQW3kQQCKQl6x1aL2pw]
at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.onFailure(TransportNodesOperationAction.java:206)
at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$1000(TransportNodesOperationAction.java:97)
at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction$4.handleException(TransportNodesOperationAction.java:178)
at org.elasticsearch.transport.TransportService$3.run(TransportService.java:217)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.SendRequestTransportException: [pp-pod2-es03-apollo][inet[/10.244.43.10:9300]][internal:cluster/nodes/indices/shard/store[n]]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:213)
at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.start(TransportNodesOperationAction.java:165)
at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$300(TransportNodesOperationAction.java:97)
at org.elasticsearch.action.support.nodes.TransportNodesOperationAction.doExecute(TransportNodesOperationAction.java:70)
at org.elasticsearch.action.support.nodes.TransportNodesOperationAction.doExecute(TransportNodesOperationAction.java:43)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:75)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:55)
at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.list(TransportNodesListShardStoreMetaData.java:79)
at org.elasticsearch.gateway.local.LocalGatewayAllocator.buildShardStores(LocalGatewayAllocator.java:454)
at org.elasticsearch.gateway.local.LocalGatewayAllocator.allocateUnassigned(LocalGatewayAllocator.java:292)
at org.elasticsearch.cluster.routing.allocation.allocator.ShardsAllocators.allocateUnassigned(ShardsAllocators.java:74)
at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:217)
at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:160)
at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:146)
at org.elasticsearch.discovery.zen.ZenDiscovery$5.execute(ZenDiscovery.java:538)
at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:347)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:184)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:154)
... 3 more
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [pp-pod2-es03-apollo][inet[/10.244.43.10:9300]] Node not connected
at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:946)
at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:640)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:199)
... 20 more
The cluster configuration is:
cluster.name: testcluster
cluster.routing.allocation.awareness.atrributes: zone
discovery.zen.minimum_master_nodes: 4
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["es01", "es02", "es03", "es04", "es05", "es06"]
jmx.create_connector: true
node.max_local_storage_nodes: 1
node.name: es01
node.zone: Zone A
path.data: /tmp/es01
threadpool.bulk.queue_size: 100
threadpool.bulk.size: 30
threadpool.bulk.type: fixed
action.auto_create_index: false
Is anyone able to provide any advice on why it takes so long for the cluster to detect that es03 and es06 have been lost?
Thanks,
Brent