High disk watermark exceeded on one or more nodes, rerouting shards


(Thy Fere) #1

I am running Elasticsearch, and Kibana on Windows and using Synology NAS as storage for Elasticsearch. For few days, Elasticsearch started behaving weird; therefore, I checked elasticsearch.log and found the following errors:

[WARN ][cluster.routing.allocation.decider] [Desmond Pitt] high disk watermark [0b] exceeded on [O2-Ef7fET9S_MJNAL-q_yA][Desmond Pitt] free: -1b[100%], shards will be relocated away from this node
[WARN ][cluster.routing.allocation.decider] [Desmond Pitt] high disk watermark [0b] exceeded on [O2-Ef7fET9S_MJNAL-q_yA][Desmond Pitt] free: -1b[100%], shards will be relocated away from this node

[INFO ][cluster.routing.allocation.decider] [Desmond Pitt] high disk watermark exceeded on one or more nodes, rerouting shards
DEBUG][action.bulk ] [Desmond Pitt] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
DEBUG][action.bulk ] [Desmond Pitt] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]

What could be the issue?


(Boaz Leskes) #2

The numbers of 0 bytes and -1b look very suspicious. Maybe the NAS file system is reporting the wrong disk space? When this happens, can check what the GET _node/stats returns in terms of disk space?


(Thy Fere) #3

Hi Bleskes,

I am running Elasticsearch on Windows box. Where can I run this command?


(Thy Fere) #4

All right.. I managed to ran to the command. Here is the result:

{"cluster_name":"elasticsearch","nodes":{"GSqQreOURqmt7tlHf-S2GA":{"timestamp":1
452402067987,"name":"Elektra","transport_address":"inet[/192.195.88.229:9300]","
host":"tamuq-syslog","ip":["inet[/192.195.88.229:9300]","NONE"],"indices":{"docs
":{"count":191625251,"deleted":0},"store":{"size_in_bytes":191255919588,"throttl
e_time_in_millis":1055954},"indexing":{"index_total":5854665,"index_time_in_mill
is":5035658,"index_current":4,"delete_total":0,"delete_time_in_millis":0,"delete
current":0,"noop_update_total":0,"is_throttled":false,"throttle_time_in_millis"
:0},"get":{"total":2,"time_in_millis":38,"exists_total":2,"exists_time_in_millis
":38,"missing_total":0,"missing_time_in_millis":0,"current":0},"search":{"open_c
ontexts":0,"query_total":435,"query_time_in_millis":191327,"query_current":0,"fe
tch_total":6,"fetch_time_in_millis":8589,"fetch_current":0},"merges":{"current":
1,"current_docs":4505,"current_size_in_bytes":5058938,"total":17410,"total_time

in_millis":16608728,"total_docs":51877270,"total_size_in_bytes":58559988274},"re
fresh":{"total":159830,"total_time_in_millis":25448949},"flush":{"total":1548,"t
otal_time_in_millis":353974},"warmer":{"current":0,"total":334721,"total_time_in
millis":1725614},"filter_cache":{"memory_size_in_bytes":36040,"evictions":0},"i
d_cache":{"memory_size_in_bytes":0},"fielddata":{"memory_size_in_bytes":44919781
2,"evictions":0},"percolate":{"total":0,"time_in_millis":0,"current":0,"memory_s
ize_in_bytes":-1,"memory_size":"-1b","queries":0},"completion":{"size_in_bytes":
0},"segments":{"count":6836,"memory_in_bytes":950875584,"index_writer_memory_in

bytes":580836,"index_writer_max_memory_in_bytes":829966741,"version_map_memory_i
n_bytes":12800,"fixed_bit_set_memory_in_bytes":0},"translog":{"operations":332,"
size_in_bytes":17},"suggest":{"total":0,"time_in_millis":0,"current":0},"query_c
ache":{"memory_size_in_bytes":0,"evictions":0,"hit_count":0,"miss_count":0},"rec
overy":{"current_as_source":0,"current_as_target":0,"throttle_time_in_millis":0}
},"os":{"timestamp":1452402078885,"uptime_in_millis":341884,"cpu":{"sys":19,"use
r":28,"idle":51,"usage":47,"stolen":0},"mem":{"free_in_bytes":10495774720,"used_
in_bytes":6683623424,"free_percent":62,"used_percent":37,"actual_free_in_bytes":
10709585920,"actual_used_in_bytes":6469812224},"swap":{"used_in_bytes":615192985
6,"free_in_bytes":13577605120}},"process":{"timestamp":1452402078885,"open_file_
descriptors":15814,"cpu":{"percent":0,"sys_in_millis":141210,"user_in_millis":29
5573,"total_in_millis":436783},"mem":{"resident_in_bytes":270995456,"share_in_by
tes":-1,"total_virtual_in_bytes":2431987712}},"jvm":{"timestamp":1452402078885,"
uptime_in_millis":262283078,"mem":{"heap_used_in_bytes":3067520784,"heap_used_pe
rcent":95,"heap_committed_in_bytes":3203792896,"heap_max_in_bytes":3203792896,"n
on_heap_used_in_bytes":91917160,"non_heap_committed_in_bytes":93749248,"pools":{
"young":{"used_in_bytes":54359600,"max_in_bytes":139591680,"peak_used_in_bytes":
139591680,"peak_max_in_bytes":139591680},"survivor":{"used_in_bytes":8907216,"ma
x_in_bytes":17432576,"peak_used_in_bytes":17432576,"peak_max_in_bytes":17432576}
,"old":{"used_in_bytes":3004253968,"max_in_bytes":3046768640,"peak_used_in_bytes
":3013743144,"peak_max_in_bytes":3046768640}}},"threads":{"count":49,"peak_count
":52},"gc":{"collectors":{"young":{"collection_count":19821,"collection_time_in_
millis":400711},"old":{"collection_count":26661,"collection_time_in_millis":1280
905}}},"buffer_pools":{"direct":{"count":61,"used_in_bytes":6511873,"total_capac
ity_in_bytes":6511873},"mapped":{"count":14055,"used_in_bytes":190789203170,"tot
al_capacity_in_bytes":190789203170}}},"thread_pool":{"percolate":{"threads":0,"q
ueue":0,"active":0,"rejected":0,"largest":0,"completed":0},"fetch_shard_started"
:{"threads":1,"queue":0,"active":0,"rejected":0,"largest":4,"completed":685},"li
stener":{"threads":1,"queue":0,"active":0,"rejected":0,"largest":1,"completed":3
570},"index":{"threads":0,"queue":0,"active":0,"rejected":0,"largest":0,"complet
ed":0},"refresh":{"threads":1,"queue":0,"active":1,"rejected":0,"largest":1,"com
pleted":159332},"suggest":{"threads":0,"queue":0,"active":0,"rejected":0,"larges
t":0,"completed":0},"generic":{"threads":1,"queue":0,"active":0,"rejected":0,"la
rgest":6,"completed":32671},"warmer":{"threads":1,"queue":0,"active":0,"rejected
":0,"largest":1,"completed":174966},"search":{"threads":4,"queue":0,"active":0,"
rejected":0,"largest":4,"completed":443},"flush":{"threads":1,"queue":0,"active"
:0,"rejected":0,"largest":1,"completed":161214},"optimize":{"threads":0,"queue":
0,"active":0,"rejected":0,"largest":0,"completed":0},"fetch_shard_store"


(Thy Fere) #5

Cont.....

:{"threa
ds":0,"queue":0,"active":0,"rejected":0,"largest":0,"completed":0},"management":
{"threads":5,"queue":4,"active":5,"rejected":0,"largest":5,"completed":4141058},
"get":{"threads":2,"queue":0,"active":0,"rejected":0,"largest":2,"completed":2},
"merge":{"threads":1,"queue":0,"active":0,"rejected":0,"largest":1,"completed":1
7427},"bulk":{"threads":2,"queue":0,"active":0,"rejected":0,"largest":2,"complet
ed":432749},"snapshot":{"threads":0,"queue":0,"active":0,"rejected":0,"largest":
0,"completed":0}},"network":{"tcp":{"active_opens":7278,"passive_opens":7572,"cu
rr_estab":69,"in_segs":291127065,"out_segs":282045676,"retrans_segs":1269920,"es
tab_resets":2582,"attempt_fails":123,"in_errs":1,"out_rsts":3383}},"fs":{"timest
amp":1452402078885,"total":{},"data":[{"path":"\\tamuq-synology1.qatar.tamu.ed
u\syslog\elasticsearch\nodes\0"}]},"transport":{"server_open":13,"rx_count":
6,"rx_size_in_bytes":1464,"tx_count":6,"tx_size_in_bytes":1464},"http":{"current
_open":2,"total_opened":11},"breakers":{"fielddata":{"limit_size_in_bytes":19222
75737,"limit_size":"1.7gb","estimated_size_in_bytes":449197812,"estimated_size":
"428.3mb","overhead":1.03,"tripped":0},"request":{"limit_size_in_bytes":12815171
58,"limit_size":"1.1gb","estimated_size_in_bytes":16440,"estimated_size":"16kb",
"overhead":1.0,"tripped":0},"parent":{"limit_size_in_bytes":2242655027,"limit_si
ze":"2gb","estimated_size_in_bytes":449214252,"estimated_size":"428.4mb","overhe
ad":1.0,"tripped":0}}}}}


(Boaz Leskes) #6

This is indeed what I suspected - the file system fails to report disk usage, which confuses the high water mark check. Can you open an issue about this on github? we should not reroute but rather just log a warning IMO.


(Thy Fere) #7

Thanks Bleskes,

Just to clarify, NAS failed to report disk usage? I have already opened an issue on GitHub but what could be the issue as per your experience?


(Thy Fere) #8

So, I opened an issue there and here is their reply:

They say, it's an Elasticsearch issue.


(Spencer Alger) #9

I think @bleskes meant to ask that you file an issue on the elasticsearch issue tracker


(Boaz Leskes) #10

Sorry. I don’t know what the NAS fails…


(Thy Fere) #11

Hi Bleskes,

What's your experience in regards to attach NAS to Elasticsearch? Does it work normally or are their any hiccups with compare to SAN LUN?


(Boaz Leskes) #12

I can not help with comparing one NAS to another. I can say that using a NAS with ES at all typically leads to poor performance and problems. Remember that ES already has two copies of your data. NAS based redundancy is typically not needed.


(Thy Fere) #13

Hi Again,

As per https://github.com/elastic/elasticsearch/issues/16082, I am going to ask all relevant questions here from now on.

How can I turn off cluster routing allocation disk threshold? Second, if it is enabled and I am getting "high disk watermark exceeded on one or more nodes, rerouting shards" warning, will it stop logs being dumped on shared drive?

Now, I am also start getting "observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]", what is this?


(Boaz Leskes) #14

How can I turn off cluster routing allocation disk threshold?

Like so:

curl -XPUT "http://localhost:9200/_cluster/settings" -d'
{
  "persistent": {
    "cluster": {
      "routing": {
        "allocation.disk.threshold_enabled": false
      }
    }
  }
}'

Not sure what you mean to be honest. Which logs do you mean?

I need to know where this comes from to say. What is the first part of that line?


(Thy Fere) #15

When I run it, it throws the following error:

{"error":"JsonParseException[Unexpected character (''' (code 39)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')\n at [Source: [B@70a96f8c; line: 1, column: 2]]","status":500}curl: (6) Could not resolve host: persistentcurl: (3) [globbing] unmatched brace in column 1 curl: (6) Could not resolve host: cluster curl: (3) [globbing] unmatched brace in column 1 curl: (6) Could not resolve host: routing curl: (3) [globbing] unmatched brace in column 1 curl: (6) Could not resolve host: allocation.disk.threshold_enabled curl: (6) Could not resolve host: false curl: (3) [globbing] unmatched close brace/bracket in column 1 curl: (3) [globbing] unmatched close brace/bracket in column 1 curl: (3) [globbing] unmatched close brace/bracket in column 1


(Boaz Leskes) #16

I think something went wrong with the copy paste from here... also, I forgot that you use windows. The command I gave you is for linux. You will have to call a PUT request to the url I specified (replacing the hostname and port if needed). The body of the request should the part between the curly braces.


(Thy Fere) #17

So now I ran the following command:

curl -put localhost:9200/_cluster/settings -d '{"persistent" : {"cluster.routing.allocation.disk.threshold_enabled" : false}}' still I got this error:

{"error":"InvalidIndexNameException[[_cluster] Invalid index name [cluster], mu
st not start with '
']","status":400}

Not Found

Not Found


HTTP Error 404. The requested resource is not found.

curl: (3) [globbing] unmatched brace in column 1 curl: (6) Could not resolve host: cluster.routing.allocation.disk.threshold_enab led Not Found

Not Found


HTTP Error 404. The requested resource is not found.

curl: (6) Could not resolve host: false curl: (3) [globbing] unmatched close brace/bracket in column 1 curl: (3) [globbing] unmatched close brace/bracket in column 1

(system) #18