Hi Team,
Of late, I've started seeing these errors in ES 6.8.6 cluster. I have this cluster since March but these errors started appearing only recently:
[2020-12-14T00:00:04,122][WARN ][o.e.c.r.a.AllocationService] [node1.foo.bar.com]
failing shard [failed shard, shard [.kibana_7][0], node[9b3APiVrTliXlxUA4RR3Rg], [R],
s[STARTED], a[id=EwFxAm38RkiMwEgbV7bGaA], message [failed to perform
indices:data/write/bulk[s] on replica [.kibana_7][0], node[9b3APiVrTliXlxUA4RR3Rg], [R],
s[STARTED], a[id=EwFxAm38RkiMwEgbV7bGaA]], failure
[RemoteTransportException[[node1.foo.bar.com][1.1.1.1:9300][indices:data/write/bulk[s][r]]];
nested: AlreadyClosedException[Underlying file changed by an external force at 2020-12-
10T05:58:11Z,
(lock=NativeFSLock(path=/data/disk1/data/nodes/0/indices/46Z0vkIURCCtFLIKx3aHow/0/index/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807
exclusive valid],creationTime=2020-12-10T05:58:11.385213Z))]; ], markAsStale [true]]
org.elasticsearch.transport.RemoteTransportException: [node1.foo.bar.com][1.1.1.1:9300][indices:data/write/bulk[s][r]]
This happens once a week or more often as well. I initially suspected the Qualys scan agent and had it disabled but still the error appears. The cluster goes into yellow state but automatically recovers on its own without me having to restart the ES Cluster. Sometimes, though I do have to restart the cluster or else just close and re-open the affected index.
Can anyone shed light on what could be wrong? Is there a way to know which process is modifying the files? I suspect it could be Anti-virus since the UUID names as folder might be suspicious to it. But how can I know which process modifies it?
[root@node1.foo.bar.com]# ll -lrt
total 212
-rw-r--r--. 1 elasticsearch elasticsearch 0 Dec 14 00:00 write.lock
ES Cluster: 6.8.6. Self managed.
Data Node: 55 GB RAM. 8 TB SSDs. 16 cores.
Total 10 data nodes and 3 master nodes.