Underlying file changed by an external force

Hi Team,

Of late, I've started seeing these errors in ES 6.8.6 cluster. I have this cluster since March but these errors started appearing only recently:

[2020-12-14T00:00:04,122][WARN ][o.e.c.r.a.AllocationService] [node1.foo.bar.com] 
failing shard [failed shard, shard [.kibana_7][0], node[9b3APiVrTliXlxUA4RR3Rg], [R],
s[STARTED], a[id=EwFxAm38RkiMwEgbV7bGaA], message [failed to perform
indices:data/write/bulk[s] on replica [.kibana_7][0], node[9b3APiVrTliXlxUA4RR3Rg], [R],
s[STARTED], a[id=EwFxAm38RkiMwEgbV7bGaA]], failure 
[RemoteTransportException[[node1.foo.bar.com][1.1.1.1:9300][indices:data/write/bulk[s][r]]]; 
nested: AlreadyClosedException[Underlying file changed by an external force at 2020-12-
10T05:58:11Z, 
   (lock=NativeFSLock(path=/data/disk1/data/nodes/0/indices/46Z0vkIURCCtFLIKx3aHow/0/index/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 
exclusive valid],creationTime=2020-12-10T05:58:11.385213Z))]; ], markAsStale [true]]
org.elasticsearch.transport.RemoteTransportException: [node1.foo.bar.com][1.1.1.1:9300][indices:data/write/bulk[s][r]]

This happens once a week or more often as well. I initially suspected the Qualys scan agent and had it disabled but still the error appears. The cluster goes into yellow state but automatically recovers on its own without me having to restart the ES Cluster. Sometimes, though I do have to restart the cluster or else just close and re-open the affected index.

Can anyone shed light on what could be wrong? Is there a way to know which process is modifying the files? I suspect it could be Anti-virus since the UUID names as folder might be suspicious to it. But how can I know which process modifies it?

[root@node1.foo.bar.com]# ll -lrt
total 212
-rw-r--r--. 1 elasticsearch elasticsearch     0 Dec 14 00:00 write.lock

ES Cluster: 6.8.6. Self managed.
Data Node: 55 GB RAM. 8 TB SSDs. 16 cores.

Total 10 data nodes and 3 master nodes.

It's definitely something other than Elasticsearch meddling with Elasticsearch's data. it could well be something like an antivirus program. Pinning down the specific process that's causing your problems is tricky, however, particularly if you've tried disabling the suspects without success. You could try running lsof in a loop in the hope of catching another process looking at Elasticsearch's files?

1 Like

Thanks David. I set-up the following audit rule:

auditctl -a always,exit -F dir=/data/disk1/data/nodes/0/indices -F perm=wa -F uid!=elasticsearch -k mykey

This will monitor any files changes (wa - writes and attribute changes) in the directory /data/disk1/data/nodes/0/indices and its sub-dirs. And it will exclude the changes made by elasticsearch user.

And then set-up a cron job that checks if /sbin/ausearch -i --input-logs -k mykey has any output and if so, triggers an email alert.

Let me know if you have any other suggestions.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.