Low disk watermark [15%] exceeded on


(Sunil Chaudhari) #1

Hi,
I have 3 nodes cluster in our environment. Each node has data.path where 70GB space is available.
Still ES is showing "low disk watermark [15%] exceeded on".
Can anybody explain me why its that?

br,
Sunil Chaudhari.


(David Pilato) #2

You have less than 15% of the total disk space remaining free.

You can change this settings to an absolute value or change the percentage.

Look at https://www.elastic.co/guide/en/elasticsearch/reference/current/disk.html


(Sunil Chaudhari) #3

Hi,
but look at below: elasticsearch is the partition where data files are located. and its 20% used only.

# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_main-lv_root
                      9.6G  3.7G  5.5G  41% /
tmpfs                 3.9G     0  3.9G   0% /dev/shm
/dev/mapper/vg_main-lv_backup
                      4.8G  9.9M  4.5G   1% /backup
/dev/sda1             488M   61M  403M  14% /boot
/dev/mapper/vg_main-lv_home
                      7.6G  733M  6.5G  10% /home
/dev/mapper/vg_main-lv_log
                      9.6G   23M  9.1G   1% /log
/dev/mapper/vg_main-lv_tmp
                      4.8G   11M  4.5G   1% /tmp
/dev/mapper/vg_main-lv_var
                      4.8G  346M  4.2G   8% /var
/dev/mapper/vg_main-lv_varlog
                      4.8G  142M  4.4G   4% /var/log
/dev/mapper/vg_main-lv_varlogaudit
                      4.8G   36M  4.5G   1% /var/log/audit
/dev/mapper/vg_data-lv_elasticsearch
                       79G   15G   61G  20% /elasticsearch

(David Pilato) #4

Interesting. @dakrone do you have an idea?


(Lee Hinman) #5

Can you enable TRACE logging for the cluster package on the master node for a little bit? It will log all of the collected disk stats about each of the nodes.

You should be able to with:

PUT /_cluster/settings
{
  "transient": {
    "logger.cluster": "TRACE"
  }
}

(Lee Hinman) #6

Also, can you collect the output of df -h on all of the data nodes so I can correlate the reported vs actual disk?


(Mark Walkom) #7

Also, what version are you on?


(Sunil Chaudhari) #8

Hi @dakrone,
Do I need to restart ES after enabling TRACE log via PUT command?


(Mark Walkom) #9

You do not.


(Sunil Chaudhari) #10

Hi, @warkolm, @dakrone ,
below is consolidated information from my cluster.
ES version 1.5.2
3 Nodes on multiple hosts given below.

  1. "sit-0" master-true data-true --> index.number of shards 3 and replicas -1
  2. "sit-1" master- false data-true --> index.number of shards 3 and replicas -1
  3. "sit-2" master- false data-true --> index.number of shards 3 and replicas -1

#df -h on sit-0
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_main-lv_root
9.6G 3.7G 5.5G 41% /
tmpfs 3.9G 0 3.9G 0% /dev/shm
/dev/mapper/vg_main-lv_backup
4.8G 9.9M 4.5G 1% /backup
/dev/sda1 488M 61M 403M 14% /boot
/dev/mapper/vg_main-lv_home
7.6G 733M 6.5G 10% /home
/dev/mapper/vg_main-lv_log
9.6G 23M 9.1G 1% /log
/dev/mapper/vg_main-lv_tmp
4.8G 11M 4.5G 1% /tmp
/dev/mapper/vg_main-lv_var
4.8G 346M 4.2G 8% /var
/dev/mapper/vg_main-lv_varlog
4.8G 623M 3.9G 14% /var/log
/dev/mapper/vg_main-lv_varlogaudit
4.8G 36M 4.5G 1% /var/log/audit
/dev/mapper/vg_data-lv_elasticsearch
79G 15G 61G 20% /elasticsearch

df -h on sit-1

Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_main-lv_root
                       99G   32G   62G  34% /
tmpfs                 3.9G     0  3.9G   0% /dev/shm
/dev/mapper/vg_main-lv_backup
                      4.8G  9.9M  4.5G   1% /backup
/dev/sda1             488M   61M  402M  14% /boot
/dev/mapper/vg_main-lv_home
                      7.6G  488M  6.8G   7% /home
/dev/mapper/vg_main-lv_log
                      9.6G   23M  9.1G   1% /log
/dev/mapper/vg_main-lv_tmp
                      4.8G  9.9M  4.5G   1% /tmp
/dev/mapper/vg_main-lv_var
                      4.8G  343M  4.2G   8% /var
/dev/mapper/vg_main-lv_varlog
                      4.8G   40M  4.5G   1% /var/log
/dev/mapper/vg_main-lv_varlogaudit
                      4.8G   39M  4.5G   1% /var/log/audit
/dev/mapper/vg_data-lv_elasticsearch
                       79G   56M   75G   1% /elasticsearch

#df -h on sit-2
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_main-lv_root
99G 20G 74G 22% /
tmpfs 3.9G 0 3.9G 0% /dev/shm
/dev/mapper/vg_main-lv_backup
4.8G 9.9M 4.5G 1% /backup
/dev/sda1 488M 61M 403M 14% /boot
/dev/mapper/vg_main-lv_home
7.6G 255M 7.0G 4% /home
/dev/mapper/vg_main-lv_log
9.6G 23M 9.1G 1% /log
/dev/mapper/vg_main-lv_tmp
4.8G 11M 4.5G 1% /tmp
/dev/mapper/vg_main-lv_var
4.8G 344M 4.2G 8% /var
/dev/mapper/vg_main-lv_varlog
4.8G 40M 4.5G 1% /var/log
/dev/mapper/vg_main-lv_varlogaudit
4.8G 39M 4.5G 1% /var/log/audit
/dev/mapper/vg_data-lv_elasticsearch
79G 8.9G 66G 12% /elasticsearch

Few TRACE logs.

[WARN ][cluster.routing.allocation.decider] [sit-master-data-node-0] After allocating, node [fmJY4Z4ISjmSEX8jbdsJ7A] would have less than the required 5gb free bytes threshold (4428105937 bytes free), preventing allocation
[INFO ][cluster] [sit-master-data-node-0] updating [cluster.info.update.interval] from [1m] to [1m]
[INFO ][cluster.routing.allocation.decider] [sit-master-data-node-0] updating [cluster.routing.allocation.disk.watermark.low] to [80%]
 [INFO ][cluster.routing.allocation.decider] [sit-master-data-node-0] updating [cluster.routing.allocation.disk.watermark.high] to [5gb]
][TRACE][cluster.service] ack received from node [[sit-master-data-node-0][oL29yf7LQI2pxFJy09sYhg][hostname.xyz.fi][inet[/xx.xxx.xx.xx:9300]]{master=true}], cluster_state update (version: 1695)
][TRACE][cluster.service  ] all expected nodes acknowledged cluster_state update (version: 1695)
][DEBUG][cluster.service  ] [sit-master-data-node-0] processing [cluster_update_settings]: done applying updated cluster_state (version: 1695)
[DEBUG][cluster.service ] [sit-master-data-node-0] processing [reroute_after_cluster_update_settings]: execute
  TRACE][cluster.routing.allocation.decider] [sit-master-data-node-0] Can not allocate [[servicepoint-2015-10-13][3], node[null], [R], s[UNASSIGNED]] on node [fmJY4Z4ISjmSEX8jbdsJ7A] due to [ReplicaAfterPrimaryActiveAllocationDecider]

I hope I have given full information.


(Lee Hinman) #11

This is missing the logging from the master node, you should see messages from this logging message:

logger.trace("node: [{}], most available: total disk: {}, available disk: {} / least available: total disk: {}, available disk: {}", nodeId, mostAvailablePath.getTotal(), leastAvailablePath.getAvailable(), leastAvailablePath.getTotal(), leastAvailablePath.getAvailable());

Do you have those logs on the master node?


(Sunil Chaudhari) #12

Hi,
I have given few logs below.

[INFO ][cluster.service          ] [sit-master-data-node-0] added {[sit-data-node-1][1WGmqNYBS4SJZUatz-3HTg][lus00080.lij.fi][inet[/xx.xxx.x.xx:9300]]{master=false},}, reason: zen-disco-receive(join from node[[sit-data-node-1][1WGmqNYBS4SJZUatz-3HTg][lus00080.lij.fi][inet[/xx.xxx.x.xx::9300]]{master=false}])
][DEBUG][cluster.service          ] [sit-master-data-node-0] publishing cluster state version 3167
 [DEBUG][cluster.service          ] [sit-master-data-node-0] set local cluster state to version 3167
[DEBUG][cluster                  ] [sit-master-data-node-0] data node was added, retrieving new cluster info
[TRACE][cluster                  ] [sit-master-data-node-0] Performing ClusterInfoUpdateJob
][DEBUG][cluster.service          ] [sit-master-data-node-0] processing [zen-disco-receive(join from node[[sit-data-node-1][1WGmqNYBS4SJZUatz-3HTg][lus00080.lij.fi][inet[/xx.xxx.x.xx::9300]]{master=false}])]: done applying updated cluster_state (version: 3167)
[TRACE][cluster                  ] [sit-master-data-node-0] node: [1WGmqNYBS4SJZUatz-3HTg], total disk: 5051023360, available disk: 4428247040
[TRACE][cluster                  ] [sit-master-data-node-0] node: [oL29yf7LQI2pxFJy09sYhg], total disk: 84413169664, available disk: 64780533760
[TRACE][cluster                  ] [sit-master-data-node-0] shard: [.kibana][0][p] size: 15846
[TRACE][cluster                  ] [sit-master-data-node-0] shard: [ces-2015-10-14][0][p] size: 103966
[TRACE][cluster                  ] [sit-master-data-node-0] shard: [ces-2015-10-15][0][p] size: 58566
TRACE][cluster                  ] [sit-master-data-node-0] shard: [ces-2015-10-17][0][p] size: 29547

[TRACE][cluster.routing.allocation.allocator] [sit-master-data-node-0] Try relocating shard for index index [sales-2015-10-29] from node [oL29yf7LQI2pxFJy09sYhg] to node [EDtnrBZGROiV8TJ00I4wwA]
[TRACE][cluster.routing.allocation.decider] [sit-master-data-node-0] usage without relocations: [EDtnrBZGROiV8TJ00I4wwA][sit-data-node-1] free: 4.1gb[87.6%]
[TRACE][cluster.routing.allocation.decider] [sit-master-data-node-0] usage with relocations: [0 bytes] [EDtnrBZGROiV8TJ00I4wwA][sit-data-node-1] free: 4.1gb[87.6%]
][TRACE][cluster.routing.allocation.decider] [sit-master-data-node-0] Node [EDtnrBZGROiV8TJ00I4wwA] has 87.67037529598754% free disk
[WARN ][cluster.routing.allocation.decider] [sit-master-data-node-0] After allocating, node [EDtnrBZGROiV8TJ00I4wwA] would have less than the required 5gb free bytes threshold (4426862320 bytes free), preventing allocation
[TRACE][cluster.routing.allocation.decider] [sit-master-data-node-0] Can not allocate [[sales-2015-10-29][0], node[oL29yf7LQI2pxFJy09sYhg], [R], s[STARTED]] on node [EDtnrBZGROiV8TJ00I4wwA] due to [DiskThresholdDecider]
[TRACE][cluster.routing.allocation.decider] [sit-master-data-node-0] usage without relocations: [EDtnrBZGROiV8TJ00I4wwA][sit-data-node-1] free: 4.1gb[87.6%]
[TRACE][cluster.routing.allocation.decider] [sit-master-data-node-0] usage with relocations: [0 bytes] [EDtnrBZGROiV8TJ00I4wwA][sit-data-node-1] free: 4.1gb[87.6%]
][TRACE][cluster.routing.allocation.decider] [sit-master-data-node-0] Node [EDtnrBZGROiV8TJ00I4wwA] has 87.67037529598754% free disk
[WARN ][cluster.routing.allocation.decider] [sit-master-data-node-0] After allocating, node [EDtnrBZGROiV8TJ00I4wwA] would have less than the required 5gb free bytes threshold (4426932652 bytes free), preventing allocation
[TRACE][cluster.routing.allocation.decider] [sit-master-data-node-0] Can not allocate [[sales-2015-10-29][2], node[oL29yf7LQI2pxFJy09sYhg], [R], s[STARTED]] on node [EDtnrBZGROiV8TJ00I4wwA] due to [DiskThresholdDecider]
[TRACE][cluster.routing.allocation.decider] [sit-master-data-node-0] usage without relocations: [EDtnrBZGROiV8TJ00I4wwA][sit-data-node-1] free: 4.1gb[87.6%]
[TRACE][cluster.routing.allocation.decider] [sit-master-data-node-0] usage with relocations: [0 bytes] [EDtnrBZGROiV8TJ00I4wwA][sit-data-node-1] free: 4.1gb[87.6%]
[TRACE][cluster.routing.allocation.decider] [sit-master-data-node-0] Node [EDtnrBZGROiV8TJ00I4wwA] has 87.67037529598754% free disk


(Lee Hinman) #13

Okay, it looks like it collected information about 2 of the nodes:

[TRACE][cluster                  ] [sit-master-data-node-0] node: [1WGmqNYBS4SJZUatz-3HTg], total disk: 5051023360, available disk: 4428247040
[TRACE][cluster                  ] [sit-master-data-node-0] node: [oL29yf7LQI2pxFJy09sYhg], total disk: 84413169664, available disk: 64780533760

However, the EDtnrBZGROiV8TJ00I4wwA node is the actual one having an allocation problem. see:

[TRACE][cluster.routing.allocation.allocator] [sit-master-data-node-0] Try relocating shard for index index [sales-2015-10-29] from node [oL29yf7LQI2pxFJy09sYhg] to node [EDtnrBZGROiV8TJ00I4wwA]
[TRACE][cluster.routing.allocation.decider] [sit-master-data-node-0] usage without relocations: [EDtnrBZGROiV8TJ00I4wwA][sit-data-node-1] free: 4.1gb[87.6%]
[TRACE][cluster.routing.allocation.decider] [sit-master-data-node-0] usage with relocations: [0 bytes] [EDtnrBZGROiV8TJ00I4wwA][sit-data-node-1] free: 4.1gb[87.6%]
[TRACE][cluster.routing.allocation.decider] [sit-master-data-node-0] Node [EDtnrBZGROiV8TJ00I4wwA] has 87.67037529598754% free disk
[WARN ][cluster.routing.allocation.decider] [sit-master-data-node-0] After allocating, node [EDtnrBZGROiV8TJ00I4wwA] would have less than the required 5gb free bytes threshold (4426862320 bytes free), preventing allocation

EDtnrBZGROiV8TJ00I4wwA has 4.1gb of free disk and the limit has been set to 5gb, so it cannot allocate the shard there.

It should have calculated the amount of space for this node also, do you have a log line that looks like:

[TRACE][cluster ] [sit-master-data-node-0] node: [EDtnrBZGROiV8TJ00I4wwA], total disk: NNNNNNN, available disk: MMMMMMM

Where NNNNNNN and MMMMMMM are numbers?


(system) #14