I'm running 1.4.0. and using the default settings for:
cluster.routing.allocation.disk.watermark.low
and
cluster.routing.allocation.disk.watermark.high
I hit an OOME which caused me to need to cycle a node, and then all shards
that should live on that node stayed unallocated once I brought it back up.
There was no notification anywhere that I had hit any dis space limits, at
least that I could find. I tried cycling again, nothing. It wasn't until
I tried to manually reroute one of the shards that I got an indication of
what was going on:
root@ip-10-0-0-45:bddevw07[1038]:~> ./reroute
{"error":"RemoteTransportException[[elasticsearch-ip-10-0-0-12][inet[/10.0.0.12:9300]][cluster:admin/reroute]];
nested: ElasticsearchIllegalArgumentException[[allocate] allocation of
[derbysoft-20141130][0] on node
[elasticsearch-ip-10-0-0-45][Li1yyXUHR8qQn6QHCSahCg][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}
is not allowed, reason: [YES(shard is not allocated to same node or
host)][YES(node passes include/exclude/require filters)][YES(primary is
already active)][YES(below shard recovery limit of [2])][YES(allocation
disabling is ignored)][YES(allocation disabling is ignored)][YES(no
allocation awareness enabled)][YES(total shard limit disabled: [-1] <=
0)][YES(target node version [1.4.0] is same or newer than source node
version [1.4.0])][NO(less than required [15.0%] free disk on node, free:
[15.0%])][YES(shard not primary or relocation disabled)]]; ","status":400}
Then I cleaned up some disk space, but there was no auto re-allocation
afterwards. Once I again tried to manually re-route a shard, then ALL of
them began rerouting.
My questions are:
Is there a notification log message somewhere that I missed that would
have let me know what was going on? If not, there sure should be!
Should the shard allocation process have started automatically once I
got the disk space issue resolved?
I'm running 1.4.0. and using the default settings for:
cluster.routing.allocation.disk.watermark.low
and
cluster.routing.allocation.disk.watermark.high
I hit an OOME which caused me to need to cycle a node, and then all shards
that should live on that node stayed unallocated once I brought it back up.
There was no notification anywhere that I had hit any dis space limits, at
least that I could find. I tried cycling again, nothing. It wasn't until
I tried to manually reroute one of the shards that I got an indication of
what was going on:
root@ip-10-0-0-45:bddevw07[1038]:~> ./reroute
{"error":"RemoteTransportException[[elasticsearch-ip-10-0-0-12][inet[/10.0.0.12:9300]][cluster:admin/reroute]];
nested: ElasticsearchIllegalArgumentException[[allocate] allocation of
[derbysoft-20141130][0] on node
[elasticsearch-ip-10-0-0-45][Li1yyXUHR8qQn6QHCSahCg][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}
is not allowed, reason: [YES(shard is not allocated to same node or
host)][YES(node passes include/exclude/require filters)][YES(primary is
already active)][YES(below shard recovery limit of [2])][YES(allocation
disabling is ignored)][YES(allocation disabling is ignored)][YES(no
allocation awareness enabled)][YES(total shard limit disabled: [-1] <=
0)][YES(target node version [1.4.0] is same or newer than source node
version [1.4.0])][NO(less than required [15.0%] free disk on node, free:
[15.0%])][YES(shard not primary or relocation disabled)]]; ","status":400}
Then I cleaned up some disk space, but there was no auto re-allocation
afterwards. Once I again tried to manually re-route a shard, then ALL of
them began rerouting.
My questions are:
Is there a notification log message somewhere that I missed that
would have let me know what was going on? If not, there sure should be!
A WARN query log every 30 seconds was added in the very last release.
Should the shard allocation process have started automatically once
I got the disk space issue resolved?
If you have unallocated shards it should kick in after a few seconds. It
takes a few seconds for the cluster to notice the change in disk free. If
there aren't unallocated shards I've sometime found that I need to manually
shift a shard around to prime the pump. I'm not sure if that has been
fixed recently though.
I don't think that disk space should prevent a shard from coming up on a
node that already has it though. I imagine that depends on how much data
has to be copied to that node but I'm not sure.
I've got updating to 1.4.1 on my TODO list for today, as I see there were
some updates in the Release notes pertaining to this as well. I might let
things fill up again in Dev and see what happens.
Maybe I wasn't patient enough for the rerouting to start on its own. It
seems like I waited several minutes before I did it manually, but I'll pay
more attention the next time.
Thanks again for the input.
Chris
On Mon, Dec 1, 2014 at 10:35 AM, Nikolas Everett nik9000@gmail.com wrote:
I'm running 1.4.0. and using the default settings for:
cluster.routing.allocation.disk.watermark.low
and
cluster.routing.allocation.disk.watermark.high
I hit an OOME which caused me to need to cycle a node, and then all
shards that should live on that node stayed unallocated once I brought it
back up.
There was no notification anywhere that I had hit any dis space limits,
at least that I could find. I tried cycling again, nothing. It wasn't
until I tried to manually reroute one of the shards that I got an
indication of what was going on:
root@ip-10-0-0-45:bddevw07[1038]:~> ./reroute
{"error":"RemoteTransportException[[elasticsearch-ip-10-0-0-12][inet[/10.0.0.12:9300]][cluster:admin/reroute]];
nested: ElasticsearchIllegalArgumentException[[allocate] allocation of
[derbysoft-20141130][0] on node
[elasticsearch-ip-10-0-0-45][Li1yyXUHR8qQn6QHCSahCg][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}
is not allowed, reason: [YES(shard is not allocated to same node or
host)][YES(node passes include/exclude/require filters)][YES(primary is
already active)][YES(below shard recovery limit of [2])][YES(allocation
disabling is ignored)][YES(allocation disabling is ignored)][YES(no
allocation awareness enabled)][YES(total shard limit disabled: [-1] <=
0)][YES(target node version [1.4.0] is same or newer than source node
version [1.4.0])][NO(less than required [15.0%] free disk on node, free:
[15.0%])][YES(shard not primary or relocation disabled)]];
","status":400}
Then I cleaned up some disk space, but there was no auto re-allocation
afterwards. Once I again tried to manually re-route a shard, then ALL of
them began rerouting.
My questions are:
Is there a notification log message somewhere that I missed that
would have let me know what was going on? If not, there sure should be!
A WARN query log every 30 seconds was added in the very last release.
Should the shard allocation process have started automatically once
I got the disk space issue resolved?
If you have unallocated shards it should kick in after a few seconds. It
takes a few seconds for the cluster to notice the change in disk free. If
there aren't unallocated shards I've sometime found that I need to manually
shift a shard around to prime the pump. I'm not sure if that has been
fixed recently though.
I don't think that disk space should prevent a shard from coming up on a
node that already has it though. I imagine that depends on how much data
has to be copied to that node but I'm not sure.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.