Cluster is down on 2 nodes server high disk watermark

Hi All,
I have two node cluster suddenly cluster got down. es version 5.6 and ram size of each node is 16 GB and heap size is given 8GB. I am sharing my nod1 log , please help to find solution..

[2020-08-30T00:30:09,917][WARN ][o.e.c.r.a.DiskThresholdMonitor] [my_prodnode1] high disk watermark [90%] exceeded on [JCHa7NT_TBuEGi-5Sy7cpQ][my_prodnode2][/var/lib/elasticsearch/nodes/0] free: 204kb[0%], shards will be relocated away from this node
		

[2020-08-29T10:00:05,125][INFO ][o.e.m.j.JvmGcMonitorService] [my_prodnode1] [gc][3106046] overhead, spent [268ms] collecting in the last [1s]
[2020-08-29T10:00:06,126][INFO ][o.e.m.j.JvmGcMonitorService] [my_prodnode1] [gc][3106047] overhead, spent [274ms] collecting in the last [1s]

[2020-08-29T20:29:01,104][DEBUG][o.e.a.b.TransportShardBulkAction] [my_prodnode1] [myregindex1][0] failed to execute bulk item (update) BulkShardRequest [[myregindex1][0]] containing [org.elasticsearch.action.update.UpdateRequest@5f12f396]
org.elasticsearch.index.engine.DocumentMissingException: [znl][E692DFD2-6CB1-4DF6-91E0-82E50325B31B]: document missing

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_241]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_241]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_241]


[2020-08-29T23:50:05,006][WARN ][o.e.c.a.s.ShardStateAction] [my_prodnode1] [esendzlist][1] received shard failed for shard id [[esendzlist][1]], allocation id [DJWCJzJZQ6esChwDe6zZFA], primary term [0], message [shard failure, reason [refresh failed]], failure [IOException[No space left on device]]


[2020-08-29T23:50:12,282][DEBUG][o.e.a.b.TransportShardBulkAction] [my_prodnode1] [esendzlist][1] failed to execute bulk item (index) BulkShardRequest [[esendzlist][1]] containing [index {[esendzlist][ezlist][80_ESEND], source[n/a, actual length: [2.3mb], max length: 2kb]}]
org.elasticsearch.index.mapper.MapperParsingException: failed to parse [zlist]
        at org.elasticsearch.index.mapper.FieldMapper.parse(FieldMapper.java:298) ~[elasticsearch-5.6.16.jar:5.6.16]


[2020-08-30T01:03:18,216][WARN ][o.e.c.r.a.DiskThresholdMonitor] [my_prodnode1] high disk watermark [90%] exceeded on [JCHa7NT_TBuEGi-5Sy7cpQ][my_prodnode2][/var/lib/elasticsearch/nodes/0] free: 192kb[0%], shards will be relocated away from this node
[2020-08-30T01:03:18,216][WARN ][o.e.c.r.a.DiskThresholdMonitor] [my_prodnode1] high disk watermark [90%] exceeded on 

[2020-08-30T01:04:01,405][WARN ][o.e.i.e.Engine           ] [my_prodnode1] [sdkversion][1] failed engine [merge failed]
org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device



[2020-08-30T02:11:40,451][INFO ][o.e.m.j.JvmGcMonitorService] [my_prodnode1] [gc][22] overhead, spent [325ms] collecting in the last [1s]
[2020-08-30T02:11:50,232][DEBUG][o.e.a.s.TransportSearchAction] [my_prodnode1] All shards failed for phase: [query]
[2020-08-30T02:11:50,233][WARN ][r.suppressed             ] path: /myregindex1/znl/_search, params: {index=myregindex1, type=znl}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed

Sure looks like you are out of disk space ... only 192KB.

Thanks Steve, but hard disk having enough space , are you talking about ram space, currently ram size is 16 GB and heap is given 8GB, so i need to increase ram?

Elasticsearch is definitely seeing that there's not much space left.

What does df -h show?

hi warkolm please find below both node df -h status

**node 1:**
Filesystem                 Size  Used Avail Use% Mounted on
udev                       7.9G     0  7.9G   0% /dev
tmpfs                      1.6G  169M  1.4G  11% /run
/dev/sda4                   52G  2.0G   48G   4% /
tmpfs                      7.9G     0  7.9G   0% /dev/shm
tmpfs                      5.0M     0  5.0M   0% /run/lock
tmpfs                      7.9G     0  7.9G   0% /sys/fs/cgroup
/dev/sda3                  465M   50M  387M  12% /boot
/dev/sda1                   19G  783M   17G   5% /var
/dev/mapper/vg_u01-lv_u01  492G  8.7G  458G   2% /u01
/dev/sdc                   2.0T   18G  1.9T   1% /nfs
tmpfs                      1.6G     0  1.6G   0% /run/user/1002
tmpfs                      1.6G     0  1.6G   0% /run/user/1003

**Node2**
Filesystem                 Size  Used Avail Use% Mounted on
udev                       7.9G     0  7.9G   0% /dev
tmpfs                      1.6G  169M  1.4G  11% /run
/dev/sda4                   52G  2.0G   48G   4% /
tmpfs                      7.9G     0  7.9G   0% /dev/shm
tmpfs                      5.0M     0  5.0M   0% /run/lock
tmpfs                      7.9G     0  7.9G   0% /sys/fs/cgroup
/dev/sda1                   19G  724M   17G   5% /var
/dev/sda3                  465M   50M  387M  12% /boot
/dev/mapper/vg_u01-lv_u01  492G  8.8G  458G   2% /u01
10.201.201.63:/nfs         2.0T   18G  1.9T   1% /nfs
tmpfs                      1.6G     0  1.6G   0% /run/user/1002
tmpfs                      1.6G     0  1.6G   0% /run/user/1003


Did you change the default path.data?

also getting

[2020-08-31T05:02:26,732][INFO ][o.e.m.j.JvmGcMonitorService] [my_prodnode2] [gc][97025] overhead, spent [323ms] collecting in the last [1s]
[2020-08-31T05:02:27,743][INFO ][o.e.m.j.JvmGcMonitorService] [my_prodnode2] [gc][97026] overhead, spent [260ms] collecting in the last [1s]
[2020-08-31T05:02:29,756][INFO ][o.e.m.j.JvmGcMonitorService] [my_prodnode2] [gc][97028] overhead, spent [322ms] collecting in the last [1s]
[2020-08-31T05:02:30,756][INFO ][o.e.m.j.JvmGcMonitorService] [my_prodnode2] [gc][97029] overhead, spent [331ms] collecting in the last [1s]
[2020-08-31T05:02:31,759][INFO ][o.e.m.j.JvmGcMonitorService] [my_prodnode2] [gc][97030] overhead, spent [329ms] collecting in the last [1s]
[2020-08-31T05:02:32,769][INFO ][o.e.m.j.JvmGcMonitorService] [my_prodnode2] [gc][97031] overhead, spent [320ms] collecting in the last [1s]

That is fine, as it says it's an info, not a warn.

no , its default setting , for snapshot taken using nfs. memory server giving about RAm of hard disk?

Hi,

You are running out of disk space. Fix this issue first.
No space left on device

Regards

Dominique

Yeah, but his df -h shows space which is VERY weird.

Could be some temp issue but your /tmp space has space, in fact you have space all over.

Suggest sudo to the ES user and see what it can see, maybe some quota or other weird permission issue, or maybe you are on Docker or have unusual disk setup, even NFS but you are mounted from /dev/sda3 so very odd; I wonder if that's a SAN device or something.

Suggesting making SURE you know your data path and that it has space from that user's perspective.

Thanks Steve_Mushero for your kind attention

 i am not using any docker, space is already there, i think this pace information coming for RAM .. am i right?

Suggest trying some of the suggestions like sudo and carefully finding the paths its using, etc. All weird.