Disk Watermark issues with 1.4.0

Hi all,

I'm running 1.4.0. and using the default settings for:

cluster.routing.allocation.disk.watermark.low
and
cluster.routing.allocation.disk.watermark.high

I hit an OOME which caused me to need to cycle a node, and then all shards
that should live on that node stayed unallocated once I brought it back up.

There was no notification anywhere that I had hit any dis space limits, at
least that I could find. I tried cycling again, nothing. It wasn't until
I tried to manually reroute one of the shards that I got an indication of
what was going on:

root@ip-10-0-0-45:bddevw07[1038]:~> ./reroute
{"error":"RemoteTransportException[[elasticsearch-ip-10-0-0-12][inet[/10.0.0.12:9300]][cluster:admin/reroute]];
nested: ElasticsearchIllegalArgumentException[[allocate] allocation of
[derbysoft-20141130][0] on node
[elasticsearch-ip-10-0-0-45][Li1yyXUHR8qQn6QHCSahCg][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}
is not allowed, reason: [YES(shard is not allocated to same node or
host)][YES(node passes include/exclude/require filters)][YES(primary is
already active)][YES(below shard recovery limit of [2])][YES(allocation
disabling is ignored)][YES(allocation disabling is ignored)][YES(no
allocation awareness enabled)][YES(total shard limit disabled: [-1] <=
0)][YES(target node version [1.4.0] is same or newer than source node
version [1.4.0])][NO(less than required [15.0%] free disk on node, free:
[15.0%])][YES(shard not primary or relocation disabled)]]; ","status":400}

Then I cleaned up some disk space, but there was no auto re-allocation
afterwards. Once I again tried to manually re-route a shard, then ALL of
them began rerouting.

My questions are:

  • Is there a notification log message somewhere that I missed that would
    have let me know what was going on? If not, there sure should be!
  • Should the shard allocation process have started automatically once I
    got the disk space issue resolved?

Thanks!
Chris

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAND3DphhmWn-amiDBrmYi4rB_tYZa7%3Dn2M9PF5jVY%3DfhPTqMpg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

On Mon, Dec 1, 2014 at 11:28 AM, Chris Neal chris.neal@derbysoft.net
wrote:

Hi all,

I'm running 1.4.0. and using the default settings for:

cluster.routing.allocation.disk.watermark.low
and
cluster.routing.allocation.disk.watermark.high

I hit an OOME which caused me to need to cycle a node, and then all shards
that should live on that node stayed unallocated once I brought it back up.

There was no notification anywhere that I had hit any dis space limits, at
least that I could find. I tried cycling again, nothing. It wasn't until
I tried to manually reroute one of the shards that I got an indication of
what was going on:

root@ip-10-0-0-45:bddevw07[1038]:~> ./reroute

{"error":"RemoteTransportException[[elasticsearch-ip-10-0-0-12][inet[/10.0.0.12:9300]][cluster:admin/reroute]];
nested: ElasticsearchIllegalArgumentException[[allocate] allocation of
[derbysoft-20141130][0] on node
[elasticsearch-ip-10-0-0-45][Li1yyXUHR8qQn6QHCSahCg][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}
is not allowed, reason: [YES(shard is not allocated to same node or
host)][YES(node passes include/exclude/require filters)][YES(primary is
already active)][YES(below shard recovery limit of [2])][YES(allocation
disabling is ignored)][YES(allocation disabling is ignored)][YES(no
allocation awareness enabled)][YES(total shard limit disabled: [-1] <=
0)][YES(target node version [1.4.0] is same or newer than source node
version [1.4.0])][NO(less than required [15.0%] free disk on node, free:
[15.0%])][YES(shard not primary or relocation disabled)]]; ","status":400}

Then I cleaned up some disk space, but there was no auto re-allocation
afterwards. Once I again tried to manually re-route a shard, then ALL of
them began rerouting.

My questions are:

  • Is there a notification log message somewhere that I missed that
    would have let me know what was going on? If not, there sure should be!

A WARN query log every 30 seconds was added in the very last release.

  • Should the shard allocation process have started automatically once
    I got the disk space issue resolved?

If you have unallocated shards it should kick in after a few seconds. It
takes a few seconds for the cluster to notice the change in disk free. If
there aren't unallocated shards I've sometime found that I need to manually
shift a shard around to prime the pump. I'm not sure if that has been
fixed recently though.

I don't think that disk space should prevent a shard from coming up on a
node that already has it though. I imagine that depends on how much data
has to be copied to that node but I'm not sure.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd2YSZo4%3D2_J2quT_aR3F_SGH9p8WPEf-uMUy1W52H-L1g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks for the quick reply Nik :slight_smile:

I've got updating to 1.4.1 on my TODO list for today, as I see there were
some updates in the Release notes pertaining to this as well. I might let
things fill up again in Dev and see what happens.

Maybe I wasn't patient enough for the rerouting to start on its own. It
seems like I waited several minutes before I did it manually, but I'll pay
more attention the next time.

Thanks again for the input.
Chris

On Mon, Dec 1, 2014 at 10:35 AM, Nikolas Everett nik9000@gmail.com wrote:

On Mon, Dec 1, 2014 at 11:28 AM, Chris Neal chris.neal@derbysoft.net
wrote:

Hi all,

I'm running 1.4.0. and using the default settings for:

cluster.routing.allocation.disk.watermark.low
and
cluster.routing.allocation.disk.watermark.high

I hit an OOME which caused me to need to cycle a node, and then all
shards that should live on that node stayed unallocated once I brought it
back up.

There was no notification anywhere that I had hit any dis space limits,
at least that I could find. I tried cycling again, nothing. It wasn't
until I tried to manually reroute one of the shards that I got an
indication of what was going on:

root@ip-10-0-0-45:bddevw07[1038]:~> ./reroute

{"error":"RemoteTransportException[[elasticsearch-ip-10-0-0-12][inet[/10.0.0.12:9300]][cluster:admin/reroute]];
nested: ElasticsearchIllegalArgumentException[[allocate] allocation of
[derbysoft-20141130][0] on node
[elasticsearch-ip-10-0-0-45][Li1yyXUHR8qQn6QHCSahCg][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}
is not allowed, reason: [YES(shard is not allocated to same node or
host)][YES(node passes include/exclude/require filters)][YES(primary is
already active)][YES(below shard recovery limit of [2])][YES(allocation
disabling is ignored)][YES(allocation disabling is ignored)][YES(no
allocation awareness enabled)][YES(total shard limit disabled: [-1] <=
0)][YES(target node version [1.4.0] is same or newer than source node
version [1.4.0])][NO(less than required [15.0%] free disk on node, free:
[15.0%])][YES(shard not primary or relocation disabled)]];
","status":400}

Then I cleaned up some disk space, but there was no auto re-allocation
afterwards. Once I again tried to manually re-route a shard, then ALL of
them began rerouting.

My questions are:

  • Is there a notification log message somewhere that I missed that
    would have let me know what was going on? If not, there sure should be!

A WARN query log every 30 seconds was added in the very last release.

  • Should the shard allocation process have started automatically once
    I got the disk space issue resolved?

If you have unallocated shards it should kick in after a few seconds. It
takes a few seconds for the cluster to notice the change in disk free. If
there aren't unallocated shards I've sometime found that I need to manually
shift a shard around to prime the pump. I'm not sure if that has been
fixed recently though.

I don't think that disk space should prevent a shard from coming up on a
node that already has it though. I imagine that depends on how much data
has to be copied to that node but I'm not sure.

Nik

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd2YSZo4%3D2_J2quT_aR3F_SGH9p8WPEf-uMUy1W52H-L1g%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd2YSZo4%3D2_J2quT_aR3F_SGH9p8WPEf-uMUy1W52H-L1g%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAND3DpgLsPhJBXWFUV8SNN8LR%2ByTdKrFgf6LpceyfDddxqCvxg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.