Some quick stats:
12 nodes running 0.19.2
All indices have 4 shards.
"Live" index has 2 replicas
"New" index which is getting re-indexed has 0 replicas
Old indices (backup) have 0 replicas
Current re-indexing workflow is to write to an index with 0 replicas,
increase the replicas after indexing, move alias. During the second part,
increasing the number of replicas from 0 to 1, our SAN was experiencing
issues due to an ongoing upgrade. Thankfully, the alias was never moved and
the searchers were still able to function. After that, our "small" delta
bulk updates, which use the current aliased index, started to fail:
[0]: index [products-20121021-172240], type [product], id [564888740],
message [UnavailableShardsException[[products-20121021-172240][3] [3]
shardIt, [0] active : Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@22a75994]]
[1]: index [products-20121021-172240], type [product], id [564888241],
message [UnavailableShardsException[[products-20121021-172240][3] [3]
shardIt, [0] active : Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@22a75994]]
[2]: index [products-20121021-172240], type [product], id [564888235],
message [UnavailableShardsException[[products-20121021-172240][2] [3]
shardIt, [1] active : Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@7dae06ac]]
...
This index is the current live index, not the one that failed to replicate.
Not every bulk request was failing, just some of them. Our batch size is
2500, and the BulkResponse failure message contained
2500 UnavailableShardsExceptions. I then deleted the problematic index
(which had 6 good shards and 2 bad shards (the two shard[2] shards were
INITIALIZING and UNASSIGNED). Cluster returned to a green state. Indexing
to the good index still produced UnavailableShardsException. I removed all
replicas for the live index (replicas=0) and then readded them (cluster
back to green). Indexing still fails.
Next step was to reindex completely to a new index. Exceptions still occur
even though the index has no replicas:
failure in bulk execution:
...
[2499]: index [products-20121023-120728], type [product], id [621248845],
message [UnavailableShardsException[[products-20121023-120728][1] [1]
shardIt, [0] active : Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@3d1c53df]]
At a loss of how to proceed at this point. If a new index is failing, what
are my options?
Cheers,
Ivan
--