Replicas won't allocate after master change (0.20.6)

I just filed https://github.com/elasticsearch/elasticsearch/issues/3017, where a 0.20.6 client connecting to a 0.20.6 master resulted in a java.io.StreamCorruptedException on the master which in turn drove a re-election (that master had to be restarted to rejoin the cluster).

A side effect of the election seems to be a large number of replicas became unallocated and have remained so about 8 hours on. This is an example of what showed up in the logs just after re-election but which stopped shortly thereafter -

[2013-05-09 11:34:52,232][WARN ][indices.cluster ] [ip-10-239-70-202] [profiles_0001][26] master [[ip-10-34-144-149][IsP0kjtRS6KJ-9R3hZehwQ][inet[/10.34.144.149:9300]]{data=false, master=true, zone=eu-west-1c}] marked shard as started, but shard have not been created, mark shard as failed
[2013-05-09 11:34:52,232][WARN ][cluster.action.shard ] [ip-10-239-70-202] sending failed shard for [profiles_0001][26], node[nUOPQBwwTdihgBPosOdbxA], [P], s[STARTED], reason [master [ip-10-34-144-149][IsP0kjtRS6KJ-9R3hZehwQ][inet[/10.34.144.149:9300]]{data=false, master=true, zone=eu-west-1c} marked shard as started, but shard have not been created, mark shard as failed]

Looking through the list suggest the message above has been associated with a split cluster in the past but in this case the cluster didn't split (with the exception of the failing master having to be bounced to rejoin).

Some details -

  • Cluster is running with ec2 discovery
  • 6 data nodes, 3 master nodes
  • The masters are dedicated (node.master: true; node.data: false)
  • The data nodes are dedicated (node.master: false; node.data: true)
  • discovery.zen.minimum_master_nodes: 2

The cluster state is yellow, so the shards are all placed and serviceable, but I would like to get the cluster back to green, and am not sure how to proceed. What would be the right course of action to get the replicas allocated?

Bill

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I took a look at one of the initializing_shards the cluster is reporting, which hasn't made any progress

The one marked as recovering has the following _state -

{
"version" : 4,
"primary" : true
}

the one marked as the primary has the following _state -

{
"version" : 8,
"primary" : true
}

Is it expected that the recovering replica is marked as a primary? If not, is there a way to fix up the state declarations?

Bill

On Thursday 9 May 2013 at 20:44, Bill de hÓra wrote:

I just filed Invalid internal transport message format between 0.20.6 master and client caused re-election · Issue #3017 · elastic/elasticsearch · GitHub, where a 0.20.6 client connecting to a 0.20.6 master resulted in a java.io.StreamCorruptedException on the master which in turn drove a re-election (that master had to be restarted to rejoin the cluster).

A side effect of the election seems to be a large number of replicas became unallocated and have remained so about 8 hours on. This is an example of what showed up in the logs just after re-election but which stopped shortly thereafter -

[2013-05-09 11:34:52,232][WARN ][indices.cluster ] [ip-10-239-70-202] [profiles_0001][26] master [[ip-10-34-144-149][IsP0kjtRS6KJ-9R3hZehwQ][inet[/10.34.144.149:9300]]{data=false, master=true, zone=eu-west-1c}] marked shard as started, but shard have not been created, mark shard as failed
[2013-05-09 11:34:52,232][WARN ][cluster.action.shard ] [ip-10-239-70-202] sending failed shard for [profiles_0001][26], node[nUOPQBwwTdihgBPosOdbxA], [P], s[STARTED], reason [master [ip-10-34-144-149][IsP0kjtRS6KJ-9R3hZehwQ][inet[/10.34.144.149:9300]]{data=false, master=true, zone=eu-west-1c} marked shard as started, but shard have not been created, mark shard as failed]

Looking through the list suggest the message above has been associated with a split cluster in the past but in this case the cluster didn't split (with the exception of the failing master having to be bounced to rejoin).

Some details -

  • Cluster is running with ec2 discovery
  • 6 data nodes, 3 master nodes
  • The masters are dedicated (node.master: true; node.data: false)
  • The data nodes are dedicated (node.master: false; node.data: true)
  • discovery.zen.minimum_master_nodes: 2

The cluster state is yellow, so the shards are all placed and serviceable, but I would like to get the cluster back to green, and am not sure how to proceed. What would be the right course of action to get the replicas allocated?

Bill

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com (mailto:elasticsearch+unsubscribe@googlegroups.com).
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Bill

Where do you see { "version":8 "primary": true} ? What API call?

clint

On Fri, May 10, 2013 at 12:56 AM, Bill de hÓra bill@dehora.net wrote:

I took a look at one of the initializing_shards the cluster is reporting,
which hasn't made any progress

The one marked as recovering has the following _state -

{
"version" : 4,
"primary" : true
}

the one marked as the primary has the following _state -

{
"version" : 8,
"primary" : true
}

Is it expected that the recovering replica is marked as a primary? If not,
is there a way to fix up the state declarations?

Bill

On Thursday 9 May 2013 at 20:44, Bill de hÓra wrote:

I just filed Invalid internal transport message format between 0.20.6 master and client caused re-election · Issue #3017 · elastic/elasticsearch · GitHub,
where a 0.20.6 client connecting to a 0.20.6 master resulted in a
java.io.StreamCorruptedException on the master which in turn drove a
re-election (that master had to be restarted to rejoin the cluster).

A side effect of the election seems to be a large number of replicas
became unallocated and have remained so about 8 hours on. This is an
example of what showed up in the logs just after re-election but which
stopped shortly thereafter -

[2013-05-09 11:34:52,232][WARN ][indices.cluster ] [ip-10-239-70-202]
[profiles_0001][26] master
[[ip-10-34-144-149][IsP0kjtRS6KJ-9R3hZehwQ][inet[/10.34.144.149:9300]]{data=false,
master=true, zone=eu-west-1c}] marked shard as started, but shard have not
been created, mark shard as failed
[2013-05-09 11:34:52,232][WARN ][cluster.action.shard ]
[ip-10-239-70-202] sending failed shard for [profiles_0001][26],
node[nUOPQBwwTdihgBPosOdbxA], [P], s[STARTED], reason [master
[ip-10-34-144-149][IsP0kjtRS6KJ-9R3hZehwQ][inet[/10.34.144.149:9300]]{data=false,
master=true, zone=eu-west-1c} marked shard as started, but shard have not
been created, mark shard as failed]

Looking through the list suggest the message above has been associated
with a split cluster in the past but in this case the cluster didn't split
(with the exception of the failing master having to be bounced to rejoin).

Some details -

  • Cluster is running with ec2 discovery
  • 6 data nodes, 3 master nodes
  • The masters are dedicated (node.master: true; node.data: false)
  • The data nodes are dedicated (node.master: false; node.data: true)
  • discovery.zen.minimum_master_nodes: 2

The cluster state is yellow, so the shards are all placed and
serviceable, but I would like to get the cluster back to green, and am not
sure how to proceed. What would be the right course of action to get the
replicas allocated?

Bill

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com (mailto:
elasticsearch+unsubscribe@googlegroups.com).
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Where do you see { "version":8 "primary": true} ? What API call?

Hi Clinton,

it was in the _state file on disk for the shard in question . The other data, also marked as primary, was in the recovering replica's _state file.

The solution to unstick this was to drop the replica count to 0 for an index we can rebuild easily if needed (genomes) - it had 2 replicas in recovery along with another index (objects) that had one replica in recovery. After that the pending 73 replicas were allocated in a few minutes. There's not much information I could discern in the logs that indicated the geonames allocations were stuck or were actually going to complete at some point (they are a few GB each but were allocating for hours). A colleague observed that it would be useful to prioritise allocations by index or kill ones that are not making progress (not sure if there's an admin command for either option).

The bug that drove all this, Invalid internal transport message format between 0.20.6 master and client caused re-election · Issue #3017 · elastic/elasticsearch · GitHub, is still a mystery to us; we're going to have to see if we can replicate the behaviour by sending truncated join messages to a master. A zombie master caused by a codec failure is one thing but the cascading re-allocation and failure to complete after the election process was unexpected (operationally it means a fraction of the index becomes unavailable for writes unless quorum is avoided).

Bill

On Friday 10 May 2013 at 14:07, Clinton Gormley wrote:

Hi Bill

Where do you see { "version":8 "primary": true} ? What API call?

clint

On Fri, May 10, 2013 at 12:56 AM, Bill de hÓra <bill@dehora.net (mailto:bill@dehora.net)> wrote:

I took a look at one of the initializing_shards the cluster is reporting, which hasn't made any progress

The one marked as recovering has the following _state -

{
"version" : 4,
"primary" : true
}

the one marked as the primary has the following _state -

{
"version" : 8,
"primary" : true
}

Is it expected that the recovering replica is marked as a primary? If not, is there a way to fix up the state declarations?

Bill

On Thursday 9 May 2013 at 20:44, Bill de hÓra wrote:

I just filed Invalid internal transport message format between 0.20.6 master and client caused re-election · Issue #3017 · elastic/elasticsearch · GitHub, where a 0.20.6 client connecting to a 0.20.6 master resulted in a java.io.StreamCorruptedException on the master which in turn drove a re-election (that master had to be restarted to rejoin the cluster).

A side effect of the election seems to be a large number of replicas became unallocated and have remained so about 8 hours on. This is an example of what showed up in the logs just after re-election but which stopped shortly thereafter -

[2013-05-09 11:34:52,232][WARN ][indices.cluster ] [ip-10-239-70-202] [profiles_0001][26] master [[ip-10-34-144-149][IsP0kjtRS6KJ-9R3hZehwQ][inet[/10.34.144.149:9300]]{data=false, master=true, zone=eu-west-1c}] marked shard as started, but shard have not been created, mark shard as failed
[2013-05-09 11:34:52,232][WARN ][cluster.action.shard ] [ip-10-239-70-202] sending failed shard for [profiles_0001][26], node[nUOPQBwwTdihgBPosOdbxA], [P], s[STARTED], reason [master [ip-10-34-144-149][IsP0kjtRS6KJ-9R3hZehwQ][inet[/10.34.144.149:9300]]{data=false, master=true, zone=eu-west-1c} marked shard as started, but shard have not been created, mark shard as failed]

Looking through the list suggest the message above has been associated with a split cluster in the past but in this case the cluster didn't split (with the exception of the failing master having to be bounced to rejoin).

Some details -

  • Cluster is running with ec2 discovery
  • 6 data nodes, 3 master nodes
  • The masters are dedicated (node.master: true; node.data: false)
  • The data nodes are dedicated (node.master: false; node.data: true)
  • discovery.zen.minimum_master_nodes: 2

The cluster state is yellow, so the shards are all placed and serviceable, but I would like to get the cluster back to green, and am not sure how to proceed. What would be the right course of action to get the replicas allocated?

Bill

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com (mailto:elasticsearch%2Bunsubscribe@googlegroups.com) (mailto:elasticsearch+unsubscribe@googlegroups.com (mailto:elasticsearch%2Bunsubscribe@googlegroups.com)).
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com (mailto:elasticsearch%2Bunsubscribe@googlegroups.com).
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com (mailto:elasticsearch+unsubscribe@googlegroups.com).
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.