Recovering from a potentially corrupt cluster state: UnavailableShardsException

Some quick stats:
12 nodes running 0.19.2
All indices have 4 shards.
"Live" index has 2 replicas
"New" index which is getting re-indexed has 0 replicas
Old indices (backup) have 0 replicas

Current re-indexing workflow is to write to an index with 0 replicas,
increase the replicas after indexing, move alias. During the second part,
increasing the number of replicas from 0 to 1, our SAN was experiencing
issues due to an ongoing upgrade. Thankfully, the alias was never moved and
the searchers were still able to function. After that, our "small" delta
bulk updates, which use the current aliased index, started to fail:

[0]: index [products-20121021-172240], type [product], id [564888740],
message [UnavailableShardsException[[products-20121021-172240][3] [3]
shardIt, [0] active : Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@22a75994]]
[1]: index [products-20121021-172240], type [product], id [564888241],
message [UnavailableShardsException[[products-20121021-172240][3] [3]
shardIt, [0] active : Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@22a75994]]
[2]: index [products-20121021-172240], type [product], id [564888235],
message [UnavailableShardsException[[products-20121021-172240][2] [3]
shardIt, [1] active : Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@7dae06ac]]
...

This index is the current live index, not the one that failed to replicate.
Not every bulk request was failing, just some of them. Our batch size is
2500, and the BulkResponse failure message contained
2500 UnavailableShardsExceptions. I then deleted the problematic index
(which had 6 good shards and 2 bad shards (the two shard[2] shards were
INITIALIZING and UNASSIGNED). Cluster returned to a green state. Indexing
to the good index still produced UnavailableShardsException. I removed all
replicas for the live index (replicas=0) and then readded them (cluster
back to green). Indexing still fails.

Next step was to reindex completely to a new index. Exceptions still occur
even though the index has no replicas:

failure in bulk execution:
...
[2499]: index [products-20121023-120728], type [product], id [621248845],
message [UnavailableShardsException[[products-20121023-120728][1] [1]
shardIt, [0] active : Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@3d1c53df]]

At a loss of how to proceed at this point. If a new index is failing, what
are my options?

Cheers,

Ivan

--

If the index is in green state, then its strange that you would get unavailable shards… . Try and restart the cluster? Note, I highly recommend upgrading to latest 0.19.x, so many bugs fixed…, important ones.

On Oct 23, 2012, at 9:40 PM, Ivan Brusic ivan@brusic.com wrote:

Some quick stats:
12 nodes running 0.19.2
All indices have 4 shards.
"Live" index has 2 replicas
"New" index which is getting re-indexed has 0 replicas
Old indices (backup) have 0 replicas

Current re-indexing workflow is to write to an index with 0 replicas, increase the replicas after indexing, move alias. During the second part, increasing the number of replicas from 0 to 1, our SAN was experiencing issues due to an ongoing upgrade. Thankfully, the alias was never moved and the searchers were still able to function. After that, our "small" delta bulk updates, which use the current aliased index, started to fail:

[0]: index [products-20121021-172240], type [product], id [564888740], message [UnavailableShardsException[[products-20121021-172240][3] [3] shardIt, [0] active : Timeout waiting for [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@22a75994]]
[1]: index [products-20121021-172240], type [product], id [564888241], message [UnavailableShardsException[[products-20121021-172240][3] [3] shardIt, [0] active : Timeout waiting for [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@22a75994]]
[2]: index [products-20121021-172240], type [product], id [564888235], message [UnavailableShardsException[[products-20121021-172240][2] [3] shardIt, [1] active : Timeout waiting for [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@7dae06ac]]
...

This index is the current live index, not the one that failed to replicate. Not every bulk request was failing, just some of them. Our batch size is 2500, and the BulkResponse failure message contained 2500 UnavailableShardsExceptions. I then deleted the problematic index (which had 6 good shards and 2 bad shards (the two shard[2] shards were INITIALIZING and UNASSIGNED). Cluster returned to a green state. Indexing to the good index still produced UnavailableShardsException. I removed all replicas for the live index (replicas=0) and then readded them (cluster back to green). Indexing still fails.

Next step was to reindex completely to a new index. Exceptions still occur even though the index has no replicas:

failure in bulk execution:
...
[2499]: index [products-20121023-120728], type [product], id [621248845], message [UnavailableShardsException[[products-20121023-120728][1] [1] shardIt, [0] active : Timeout waiting for [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3d1c53df]]

At a loss of how to proceed at this point. If a new index is failing, what are my options?

Cheers,

Ivan

--

--

The issue might be a read-only filesystem on one of the machines. Will that
cause an UnavailableShardsException? Will update the thread more once I
have more data.

The plan is to upgrade to 0.20.0 once it comes out. Do not want to upgrade
twice in a short time span.

Thanks,

Ivan

On Tue, Oct 23, 2012 at 4:06 PM, kimchy@gmail.com wrote:

If the index is in green state, then its strange that you would get
unavailable shards… . Try and restart the cluster? Note, I highly recommend
upgrading to latest 0.19.x, so many bugs fixed…, important ones.

On Oct 23, 2012, at 9:40 PM, Ivan Brusic ivan@brusic.com wrote:

Some quick stats:
12 nodes running 0.19.2
All indices have 4 shards.
"Live" index has 2 replicas
"New" index which is getting re-indexed has 0 replicas
Old indices (backup) have 0 replicas

Current re-indexing workflow is to write to an index with 0 replicas,
increase the replicas after indexing, move alias. During the second part,
increasing the number of replicas from 0 to 1, our SAN was experiencing
issues due to an ongoing upgrade. Thankfully, the alias was never moved and
the searchers were still able to function. After that, our "small" delta
bulk updates, which use the current aliased index, started to fail:

[0]: index [products-20121021-172240], type [product], id [564888740],
message [UnavailableShardsException[[products-20121021-172240][3] [3]
shardIt, [0] active : Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@22a75994]]
[1]: index [products-20121021-172240], type [product], id [564888241],
message [UnavailableShardsException[[products-20121021-172240][3] [3]
shardIt, [0] active : Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@22a75994]]
[2]: index [products-20121021-172240], type [product], id [564888235],
message [UnavailableShardsException[[products-20121021-172240][2] [3]
shardIt, [1] active : Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@7dae06ac]]
...

This index is the current live index, not the one that failed to
replicate. Not every bulk request was failing, just some of them. Our batch
size is 2500, and the BulkResponse failure message contained 2500
UnavailableShardsExceptions. I then deleted the problematic index (which
had 6 good shards and 2 bad shards (the two shard[2] shards were
INITIALIZING and UNASSIGNED). Cluster returned to a green state. Indexing
to the good index still produced UnavailableShardsException. I removed all
replicas for the live index (replicas=0) and then readded them (cluster
back to green). Indexing still fails.

Next step was to reindex completely to a new index. Exceptions still
occur even though the index has no replicas:

failure in bulk execution:
...
[2499]: index [products-20121023-120728], type [product], id
[621248845], message
[UnavailableShardsException[[products-20121023-120728][1] [1] shardIt, [0]
active : Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@3d1c53df]]

At a loss of how to proceed at this point. If a new index is failing,
what are my options?

Cheers,

Ivan

--

--

--

Here is a recap:

All our nodes are VMs, so not only was the data directory affected by the
SAN failures, but the VMs themselves were as well. At one point, the nodes
started adding/removing each other as the VMs bounced around.

I finally discovered the issue once I checked (for the hundrenth time) the
cluster health and noticed that number_of_nodes was 11 instead of 12. One
machine no longer formed part of the cluster. The service was running and
was visible over the network. Unfortunatly, when several of the VMs came
back up they had one or more file-systems marked as read-only. The logs are
not available on these nodes since they had no write access.

After shutting down the service on the offending node, indexing worked
fine. Not sure at what point that server was removed from the cluster, but
it did contain data from all the indices that had issues, even the new one.
One weird issue was that that node contained data for shards numbers that
should not have existed. The cluster is still in development mode and we
are tweaking the number of shards. The index template defines
index.number_of_shards as 6, but the create index request defines the
number of shards as 4. Our indices have been working without issues using
the value (4) defined in the create index request. The node that was not
part of the cluster had shards numbered 4 and 5, which means the index had
6 shards, not 4. The other nodes correctly identified that the index had 4
shards, but one node had shards 4 and 5.

Ultimately, the issues with the cluster were solely our fault and the
read-only filesystem, but I would like to create a system that is more
fail-safe. Would a read-only filesystem result in a
UnavailableShardsException? Would a write failure appear in any of the
logs? Why did one node have a different number of shards? Do the state
files on disk have any useful information?

Unfortunately, without complete logs, it is difficult to pinpoint the exact
state of the system. One of my tasks was to add a JMS appender to the log
file, so I guess I should finally do it!

Cheers,

Ivan

On Tue, Oct 23, 2012 at 4:26 PM, Ivan Brusic ivan@brusic.com wrote:

The issue might be a read-only filesystem on one of the machines. Will
that cause an UnavailableShardsException? Will update the thread more once
I have more data.

The plan is to upgrade to 0.20.0 once it comes out. Do not want to upgrade
twice in a short time span.

Thanks,

Ivan

On Tue, Oct 23, 2012 at 4:06 PM, kimchy@gmail.com wrote:

If the index is in green state, then its strange that you would get
unavailable shards… . Try and restart the cluster? Note, I highly recommend
upgrading to latest 0.19.x, so many bugs fixed…, important ones.

On Oct 23, 2012, at 9:40 PM, Ivan Brusic ivan@brusic.com wrote:

Some quick stats:
12 nodes running 0.19.2
All indices have 4 shards.
"Live" index has 2 replicas
"New" index which is getting re-indexed has 0 replicas
Old indices (backup) have 0 replicas

Current re-indexing workflow is to write to an index with 0 replicas,
increase the replicas after indexing, move alias. During the second part,
increasing the number of replicas from 0 to 1, our SAN was experiencing
issues due to an ongoing upgrade. Thankfully, the alias was never moved and
the searchers were still able to function. After that, our "small" delta
bulk updates, which use the current aliased index, started to fail:

[0]: index [products-20121021-172240], type [product], id [564888740],
message [UnavailableShardsException[[products-20121021-172240][3] [3]
shardIt, [0] active : Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@22a75994]]
[1]: index [products-20121021-172240], type [product], id [564888241],
message [UnavailableShardsException[[products-20121021-172240][3] [3]
shardIt, [0] active : Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@22a75994]]
[2]: index [products-20121021-172240], type [product], id [564888235],
message [UnavailableShardsException[[products-20121021-172240][2] [3]
shardIt, [1] active : Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@7dae06ac]]
...

This index is the current live index, not the one that failed to
replicate. Not every bulk request was failing, just some of them. Our batch
size is 2500, and the BulkResponse failure message contained 2500
UnavailableShardsExceptions. I then deleted the problematic index (which
had 6 good shards and 2 bad shards (the two shard[2] shards were
INITIALIZING and UNASSIGNED). Cluster returned to a green state. Indexing
to the good index still produced UnavailableShardsException. I removed all
replicas for the live index (replicas=0) and then readded them (cluster
back to green). Indexing still fails.

Next step was to reindex completely to a new index. Exceptions still
occur even though the index has no replicas:

failure in bulk execution:
...
[2499]: index [products-20121023-120728], type [product], id
[621248845], message
[UnavailableShardsException[[products-20121023-120728][1] [1] shardIt, [0]
active : Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@3d1c53df]]

At a loss of how to proceed at this point. If a new index is failing,
what are my options?

Cheers,

Ivan

--

--

--