While doing some large scale indexing, I run across this problem
fairly often, especially when having indexed > 200m records. This is a
2-node cluster with 1 index and 50 shards 0 replicas. Index size is
~600GB per node with 390m records currently. Each node has 4-250GB
EBS volumes with the ES data spread across all 4 volumes. Each volume
is 62% full. Each bulk indexing request has 10k records of ~3k each.
[1]: index [documents], type [records], id [43-116530000], message
[UnavailableShardsException[[documents][21] [1] shardIt, [0] active :
Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@2c761b38]]
It looks like after this error occurs, that the master will disconnect
and restart, but elects itself as master instead of reconnecting with
the other node. I have to stop both nodes, and then restart to get the
cluster back. I have noted similar behavior with a 4-node cluster.
My basic question is: is the timeout being caused by delays in AWS,
EBS, or is it more likely a result of the large index size and how
Lucene handles disk I/O?
Secondarily, can this situation be avoided? If disk latency is too
high, can we back off/recover more gracefully?
Thanks!
- Craig
MASTER NODE
[2012-04-27 15:06:59,308][INFO ][node ] [Courier]
{0.19.2}[7517]: started
[2012-04-27 15:07:12,156][INFO ][gateway ] [Courier]
recovered [1] indices into cluster_state
[2012-04-27 15:07:28,506][INFO ][cluster.service ] [Courier]
added {[Styx and Stone][1p004qNdQ8iq2cL3otF1Jg][inet[/
10.33.15.1:9300]],}, reason: zen-disco-receive(join from node[[Styx
and Stone][1p004qNdQ8iq2cL3otF1Jg][inet[/10.33.15.1:9300]]])
[2012-04-27 15:11:21,474][INFO ][cluster.service ] [Courier]
added {[Summers, Gabriel][QzF5z1yGQ2K-kLESKRspwQ][inet[/
10.38.249.226:9301]]{client=true, data=false},}, reason: zen-disco-
receive(join from node[[Summers, Gabriel][QzF5z1yGQ2K-kLESKRspwQ]
[inet[/10.38.249.226:9301]]{client=true, data=false}])
[2012-04-27 16:12:51,368][INFO ][cluster.service ] [Courier]
removed {[Summers, Gabriel][QzF5z1yGQ2K-kLESKRspwQ][inet[/
10.38.249.226:9301]]{client=true, data=false},}, reason: zen-disco-
node_left([Summers, Gabriel][QzF5z1yGQ2K-kLESKRspwQ][inet[/
10.38.249.226:9301]]{client=true, data=false})
[2012-04-27 16:15:16,046][INFO ][cluster.service ] [Courier]
added {[Infant Terrible][IyaA9PpDSWKJ8H-7KjlScg][inet[/
10.38.249.226:9301]]{client=true, data=false},}, reason: zen-disco-
receive(join from node[[Infant Terrible][IyaA9PpDSWKJ8H-7KjlScg][inet[/
10.38.249.226:9301]]{client=true, data=false}])
[2012-04-27 17:09:28,431][INFO ][cluster.service ] [Courier]
removed {[Infant Terrible][IyaA9PpDSWKJ8H-7KjlScg][inet[/
10.38.249.226:9301]]{client=true, data=false},}, reason: zen-disco-
node_left([Infant Terrible][IyaA9PpDSWKJ8H-7KjlScg][inet[/
10.38.249.226:9301]]{client=true, data=false})
[2012-04-27 17:11:53,985][INFO ][cluster.service ] [Courier]
added {[Glamor][TZwHuJlzRE-cDB-MLKA-tw][inet[/10.38.249.226:9301]]
{client=true, data=false},}, reason: zen-disco-receive(join from
node[[Glamor][TZwHuJlzRE-cDB-MLKA-tw][inet[/10.38.249.226:9301]]
{client=true, data=false}])
[2012-04-27 17:59:17,030][INFO ][node ] [Multiple
Man] {0.19.2}[10023]: initializing ...
[2012-04-27 17:59:17,048][INFO ][plugins ] [Multiple
Man] loaded [], sites [bigdesk]
[2012-04-27 17:59:20,068][INFO ][node ] [Multiple
Man] {0.19.2}[10023]: initialized
[2012-04-27 17:59:20,069][INFO ][node ] [Multiple
Man] {0.19.2}[10023]: starting ...
[2012-04-27 17:59:20,173][INFO ][transport ] [Multiple
Man] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address
{inet[/10.38.249.226:9300]}
[2012-04-27 17:59:24,694][INFO ][cluster.service ] [Multiple
Man] new_master [Multiple Man][zQvB_5w3Tee3rW-EJ6Iahw][inet[/
10.38.249.226:9300]], reason: zen-disco-join (elected_as_master)
[2012-04-27 17:59:24,727][INFO ][discovery ] [Multiple
Man] youwhodoc/zQvB_5w3Tee3rW-EJ6Iahw
[2012-04-27 17:59:24,742][INFO ][http ] [Multiple
Man] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address
{inet[/10.38.249.226:9200]}
[2012-04-27 17:59:24,742][INFO ][node ] [Multiple
Man] {0.19.2}[10023]: started
[2012-04-27 17:59:44,560][INFO ][gateway ] [Multiple
Man] recovered [1] indices into cluster_state
[2012-04-27 15:07:05,713][INFO ][transport ] [Styx and
Stone] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address
{inet[/10.33.15.1:9300]}
[2012-04-27 15:07:28,517][INFO ][cluster.service ] [Styx and
Stone] detected_master [Courier][UmhafZyUQB-r3weDCPftIQ][inet[/
10.38.249.226:9300]], added {[Courier][UmhafZyUQB-r3weDCPftIQ][inet[/
10.38.249.226:9300]],}, reason: zen-disco-receive(from master
[[Courier][UmhafZyUQB-r3weDCPftIQ][inet[/10.38.249.226:9300]]])
[2012-04-27 15:07:28,527][INFO ][discovery ] [Styx and
Stone] youwhodoc/1p004qNdQ8iq2cL3otF1Jg
[2012-04-27 15:07:28,532][INFO ][http ] [Styx and
Stone] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address
{inet[/10.33.15.1:9200]}
[2012-04-27 15:07:28,532][INFO ][node ] [Styx and
Stone] {0.19.2}[8991]: started
[2012-04-27 15:11:21,476][INFO ][cluster.service ] [Styx and
Stone] added {[Summers, Gabriel][QzF5z1yGQ2K-kLESKRspwQ][inet[/
10.38.249.226:9301]]{client=true, data=false},}, reason: zen-disco-
receive(from master [[Courier][UmhafZyUQB-r3weDCPftIQ][inet[/
10.38.249.226:9300]]])
[2012-04-27 16:12:51,386][INFO ][cluster.service ] [Styx and
Stone] removed {[Summers, Gabriel][QzF5z1yGQ2K-kLESKRspwQ][inet[/
10.38.249.226:9301]]{client=true, data=false},}, reason: zen-disco-
receive(from master [[Courier][UmhafZyUQB-r3weDCPftIQ][inet[/
10.38.249.226:9300]]])
[2012-04-27 16:15:16,050][INFO ][cluster.service ] [Styx and
Stone] added {[Infant Terrible][IyaA9PpDSWKJ8H-7KjlScg][inet[/
10.38.249.226:9301]]{client=true, data=false},}, reason: zen-disco-
receive(from master [[Courier][UmhafZyUQB-r3weDCPftIQ][inet[/
10.38.249.226:9300]]])
[2012-04-27 17:09:28,717][INFO ][cluster.service ] [Styx and
Stone] removed {[Infant Terrible][IyaA9PpDSWKJ8H-7KjlScg][inet[/
10.38.249.226:9301]]{client=true, data=false},}, reason: zen-disco-
receive(from master [[Courier][UmhafZyUQB-r3weDCPftIQ][inet[/
10.38.249.226:9300]]])
[2012-04-27 17:11:53,989][INFO ][cluster.service ] [Styx and
Stone] added {[Glamor][TZwHuJlzRE-cDB-MLKA-tw][inet[/
10.38.249.226:9301]]{client=true, data=false},}, reason: zen-disco-
receive(from master [[Courier][UmhafZyUQB-r3weDCPftIQ][inet[/
10.38.249.226:9300]]])
[2012-04-27 17:59:10,337][INFO ][discovery.zen ] [Styx and
Stone] master_left [[Courier][UmhafZyUQB-r3weDCPftIQ][inet[/
10.38.249.226:9300]]], reason [transport disconnected (with verified
connect)]
[2012-04-27 17:59:10,386][INFO ][cluster.service ] [Styx and
Stone] master {new [Styx and Stone][1p004qNdQ8iq2cL3otF1Jg][inet[/
10.33.15.1:9300]], previous [Courier][UmhafZyUQB-r3weDCPftIQ][inet[/
10.38.249.226:9300]]}, removed {[Courier][UmhafZyUQB-r3weDCPftIQ]
[inet[/10.38.249.226:9300]],}, reason: zen-disco-master_failed
([Courier][UmhafZyUQB-r3weDCPftIQ][inet[/10.38.249.226:9300]])