High disk watermark exceeded on one or more nodes, rerouting shards

thyfere · January 6, 2016, 12:36pm

I am running Elasticsearch, and Kibana on Windows and using Synology NAS as storage for Elasticsearch. For few days, Elasticsearch started behaving weird; therefore, I checked elasticsearch.log and found the following errors:

[WARN ][cluster.routing.allocation.decider] [Desmond Pitt] high disk watermark [0b] exceeded on [O2-Ef7fET9S_MJNAL-q_yA][Desmond Pitt] free: -1b[100%], shards will be relocated away from this node
[WARN ][cluster.routing.allocation.decider] [Desmond Pitt] high disk watermark [0b] exceeded on [O2-Ef7fET9S_MJNAL-q_yA][Desmond Pitt] free: -1b[100%], shards will be relocated away from this node

[INFO ][cluster.routing.allocation.decider] [Desmond Pitt] high disk watermark exceeded on one or more nodes, rerouting shards
DEBUG][action.bulk ] [Desmond Pitt] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
DEBUG][action.bulk ] [Desmond Pitt] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]

What could be the issue?

bleskes · January 6, 2016, 8:08pm

The numbers of 0 bytes and -1b look very suspicious. Maybe the NAS file system is reporting the wrong disk space? When this happens, can check what the GET _node/stats returns in terms of disk space?

thyfere · January 10, 2016, 4:58am

Hi Bleskes,

I am running Elasticsearch on Windows box. Where can I run this command?

thyfere · January 10, 2016, 5:04am

All right.. I managed to ran to the command. Here is the result:

thyfere · January 10, 2016, 5:04am

Cont.....

:{"threa
ds":0,"queue":0,"active":0,"rejected":0,"largest":0,"completed":0},"management":
{"threads":5,"queue":4,"active":5,"rejected":0,"largest":5,"completed":4141058},
"get":{"threads":2,"queue":0,"active":0,"rejected":0,"largest":2,"completed":2},
"merge":{"threads":1,"queue":0,"active":0,"rejected":0,"largest":1,"completed":1
7427},"bulk":{"threads":2,"queue":0,"active":0,"rejected":0,"largest":2,"complet
ed":432749},"snapshot":{"threads":0,"queue":0,"active":0,"rejected":0,"largest":
0,"completed":0}},"network":{"tcp":{"active_opens":7278,"passive_opens":7572,"cu
rr_estab":69,"in_segs":291127065,"out_segs":282045676,"retrans_segs":1269920,"es
tab_resets":2582,"attempt_fails":123,"in_errs":1,"out_rsts":3383}},"fs":{"timest
amp":1452402078885,"total":{},"data":[{"path":"\\tamuq-synology1.qatar.tamu.ed
u\syslog\elasticsearch\nodes\0"}]},"transport":{"server_open":13,"rx_count":
6,"rx_size_in_bytes":1464,"tx_count":6,"tx_size_in_bytes":1464},"http":{"current
_open":2,"total_opened":11},"breakers":{"fielddata":{"limit_size_in_bytes":19222
75737,"limit_size":"1.7gb","estimated_size_in_bytes":449197812,"estimated_size":
"428.3mb","overhead":1.03,"tripped":0},"request":{"limit_size_in_bytes":12815171
58,"limit_size":"1.1gb","estimated_size_in_bytes":16440,"estimated_size":"16kb",
"overhead":1.0,"tripped":0},"parent":{"limit_size_in_bytes":2242655027,"limit_si
ze":"2gb","estimated_size_in_bytes":449214252,"estimated_size":"428.4mb","overhe
ad":1.0,"tripped":0}}}}}

bleskes · January 11, 2016, 9:26am

This is indeed what I suspected - the file system fails to report disk usage, which confuses the high water mark check. Can you open an issue about this on github? we should not reroute but rather just log a warning IMO.

thyfere · January 11, 2016, 11:04am

Thanks Bleskes,

Just to clarify, NAS failed to report disk usage? I have already opened an issue on GitHub but what could be the issue as per your experience?

thyfere · January 11, 2016, 2:58pm

So, I opened an issue there and here is their reply:

They say, it's an Elasticsearch issue.

spalger · January 11, 2016, 3:52pm

I think @bleskes meant to ask that you file an issue on the elasticsearch issue tracker

bleskes · January 11, 2016, 6:21pm

Sorry. I don’t know what the NAS fails…

thyfere · January 13, 2016, 6:41am

Hi Bleskes,

What's your experience in regards to attach NAS to Elasticsearch? Does it work normally or are their any hiccups with compare to SAN LUN?

bleskes · January 14, 2016, 5:14pm

I can not help with comparing one NAS to another. I can say that using a NAS with ES at all typically leads to poor performance and problems. Remember that ES already has two copies of your data. NAS based redundancy is typically not needed.

thyfere · January 20, 2016, 1:00pm

Hi Again,

As per https://github.com/elastic/elasticsearch/issues/16082, I am going to ask all relevant questions here from now on.

How can I turn off cluster routing allocation disk threshold? Second, if it is enabled and I am getting "high disk watermark exceeded on one or more nodes, rerouting shards" warning, will it stop logs being dumped on shared drive?

Now, I am also start getting "observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]", what is this?

bleskes · January 20, 2016, 1:55pm

How can I turn off cluster routing allocation disk threshold?

Like so:

curl -XPUT "http://localhost:9200/_cluster/settings" -d'
{
  "persistent": {
    "cluster": {
      "routing": {
        "allocation.disk.threshold_enabled": false
      }
    }
  }
}'

Not sure what you mean to be honest. Which logs do you mean?

I need to know where this comes from to say. What is the first part of that line?

thyfere · January 21, 2016, 6:54am

When I run it, it throws the following error:

{"error":"JsonParseException[Unexpected character (''' (code 39)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')\n at [Source: [B@70a96f8c; line: 1, column: 2]]","status":500}curl: (6) Could not resolve host: persistentcurl: (3) [globbing] unmatched brace in column 1 curl: (6) Could not resolve host: cluster curl: (3) [globbing] unmatched brace in column 1 curl: (6) Could not resolve host: routing curl: (3) [globbing] unmatched brace in column 1 curl: (6) Could not resolve host: allocation.disk.threshold_enabled curl: (6) Could not resolve host: false curl: (3) [globbing] unmatched close brace/bracket in column 1 curl: (3) [globbing] unmatched close brace/bracket in column 1 curl: (3) [globbing] unmatched close brace/bracket in column 1

bleskes · January 21, 2016, 7:40am

I think something went wrong with the copy paste from here... also, I forgot that you use windows. The command I gave you is for linux. You will have to call a PUT request to the url I specified (replacing the hostname and port if needed). The body of the request should the part between the curly braces.

thyfere · January 21, 2016, 8:41am

So now I ran the following command:

curl -put localhost:9200/_cluster/settings -d '{"persistent" : {"cluster.routing.allocation.disk.threshold_enabled" : false}}' still I got this error:

{"error":"InvalidIndexNameException[[_cluster] Invalid index name [cluster], mu
st not start with '']","status":400}

Not Found

HTTP Error 404. The requested resource is not found.

curl: (3) [globbing] unmatched brace in column 1 curl: (6) Could not resolve host: cluster.routing.allocation.disk.threshold_enab led Not Found

Not Found

HTTP Error 404. The requested resource is not found.

curl: (6) Could not resolve host: false curl: (3) [globbing] unmatched close brace/bracket in column 1 curl: (3) [globbing] unmatched close brace/bracket in column 1

Topic		Replies	Views
High disk watermark in elastcisearch Elasticsearch	4	8938	July 6, 2017
High disk watermark [90%] exceeded on node,shards will be relocated away from this node Elasticsearch	5	20662	October 4, 2017
High disk watermark Elasticsearch	5	501	December 16, 2019
High Disk Watermark exceeded on one or more nodes Elasticsearch	2	1059	July 6, 2017
Understanding Disk-based Shard Allocation better Elasticsearch	11	1204	March 11, 2019

High disk watermark exceeded on one or more nodes, rerouting shards

Not Found

Not Found

Related topics