Issues upgrading cluster from 0.19.8 to 0.19.10/11

This weekend I attempted to update my cluster from 0.19.8 to 0.19.11 but
ran in to issues at the very end when updating the master server.

The issues I am seeing is that any new index created will not be assigned
to any servers. Any degraded index will not attempt to recovery and will
stay in that state even once the restarted server comes back up.
I did attempt a full cluster restart and flush it did not seem to help. I
do have some custom allocations in place, but I would not expect them to
cause problems.

I was able to temporarily resolve the issue by downgrading the master
servers to 0.19.8 (later upgraded to 0.19.9 successfully) while keeping the
storage nodes as 0.19.11.

Just a bit of additional information I have nodes broken up with three
different tags: frontend, storage and backup. By default the new indexes
are assigned to the frontend nodes.

_template/default:
{
"default" : {
"template" : "*",
"order" : 0,
"settings" : {
"index.compress" : "true",
"index.routing.allocation.exclude.tag" : "backup,storage",
"index.number_of_shards" : "4",
"index.routing.allocation.total_shards_per_node" : "4"
},
"mappings" : { }
}
}

_cluster/settings:
{
"persistent" : {
"indices.recovery.max_size_per_sec" : "50mb",
"cluster.routing.allocation.exclude.tag" : ""
},
"transient" : { }
}

When I upgrade the master servers I do not see anything strange in the
logs, everything comes up properly.
As soon as I restart one of the data nodes the shards will remain
unassigned, there are no messages in the cluster log file that shows
anything going on other then nodes joining and leaving.
When a master running 0.19.9 takes over the shards immediately start
allocating properly.

I looked at the change log for 0.19.10 but I do not see anything that jumps
out being a probable cause, does anyone have any ideas on what may be
happening?

Also, I have attempted to force the index to use the frontend servers using
index.routing.allocation.include.tag but it did not make a difference.

--

I did a little more testing today since I did not get any responses on this.

DEBUG is showing messages like this:

[2012-11-15 10:51:24,806][DEBUG][cluster.routing.allocation] [Frontend 1]
[infuseds-2012.08.24][3] allocated on [[Storage
3][0gmkNk6HReiXV8udG7HBpQ][inet[/10.38.16.136:9300]]{tag=storage,
master=false}], but can no longer be allocated on it, moving...
[2012-11-15 10:51:24,807][DEBUG][cluster.routing.allocation] [Frontend 1]
[infuseds-2012.08.24][3] can't move

These messages only show up when the master node is running 0.19.10 or
later. As soon as I downgrade to 0.19.9 everything allocates properly. I
set up a test cluster and started adding the settings on the live cluster
until I was able to reproduce the error.

I was able to produce the error by setting
cluster.routing.allocation.exclude.tag to "" in the persistent or transient
cluster config. Once that is set I am unable to allocate any indexes on the
entire cluster. As soon as I change the exclude line to something other
then "" everything starts working again.

Is there a way to completely remove the
cluster.routing.allocation.exclude.tag value from _cluster/settings? I see
that it is being held in the nodes/0/_state/global file but I would rather
not mess with the file directly.

--

Hi James,

this is unfortunate, to me, it seems you were running into an undefined
edge case where an exclude tag of "" can be interpreted as a "null" tag or
a defined tag of length 0, a semantic which could have been changed
recently. I wonder, does the cluster setting update API allow you to
set cluster.routing.allocation.exclude.tag to null (the JSON null) so it
vanishes?

Best regards,

Jörg

--

Is this addressed to me?

Sent from my iPhone

On Nov 18, 2012, at 6:51 PM, Jörg Prante joergprante@gmail.com wrote:

Hi James,

this is unfortunate, to me, it seems you were running into an undefined edge case where an exclude tag of "" can be interpreted as a "null" tag or a defined tag of length 0, a semantic which could have been changed recently. I wonder, does the cluster setting update API allow you to set cluster.routing.allocation.exclude.tag to null (the JSON null) so it vanishes?

Best regards,

Jörg

--

--