Underlying file changed by an external force

Hello,

since some time we can find the following warning in certain time intervals in logs of elastic in our environments. Usually this has no effect, but sometimes not all shards are assigned and you have to do a retry here.

[2022-11-24T00:04:38,264][WARN ][o.e.i.e.Engine           ] [server-md] [logs-2022-11-23][0]failed engine [lucene commit failed]
org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2022-11-22T23:00:13Z, (lock=NativeFSLock(path=/path/elastic/var/data/storage/md/nodes/0/indices/Zz9LjRRcRhqSBFv7Cc2k2Q/0/index/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2022-11-22T23:00:13.384888Z))

At first we thought that another process is touching the indexes of Elastic. We checked with auditctl if there are other users accessing the indexes. However, this is not the case.
Afterwards we noticed that the timestamp (atime and mtime) is always exactly 0 milliseconds.

stat  /path/elastic/var/data/storage/md/nodes/0/indices/…/0/index/write.lock
  File: '/path/elastic/var/data/storage/md/nodes/0/indices/…/0/index/write.lock'
  Size: 0              Blocks: 0          IO Block: 4096   regular empty File
Device: fe05h/65029d     Inode: 71303504    Verknüpfungen: 1
Access: (0644/-rw-r--r--)  Uid: ( 1216/    elastic)   Gid: ( 1216/    elastic)
Access: 2022-11-30 00:00:09.000000000 +0100
Modify: 2022-11-30 00:00:09.000000000 +0100
Change: 2022-11-30 22:01:07.559373558 +0100
 Birth     : -

In a single case this would also be strange, but this occurs with 3202 files.

ls -aRl --time-style=full-iso | grep -c 00000
3202

However, if you run a touch command on the system they will be set correctly.

stat touch_file
  File: 'touch_file' 
  Größe: 0              Blöcke: 0          EA Block: 4096   regular empty File
Device: fe04h/65028d     Inode: 1460        Verknüpfungen: 1
Access: (0644/-rw-r--r--)  Uid: ( 8057/   user)   Gid: (  100/   users)
Access: 2022-12-01 08:07:31.168801039 +0100
Modify: 2022-12-01 08:07:31.168801039 +0100
Change: 2022-12-01 08:07:31.168801039 +0100
 Birth    : -

So the question is what is the cause of the missing nanoseconds. However, this is most likely the cause of the error, since the comparison cannot be performed correctly. The comparison must fail, because the creationTime is in the future, because the milliseconds are present here, but not with the file.

creationTime=2022-11-29T23:00:09.595968Z > Underlying file changed by an external force at  2022-11-29T23:00:09Z

We are using in SLES12-SP5/20220823 and as file system XFS and the following Elastic version:

   "version": {
        "number": "7.10.2",
         .....
        "lucene_version": "8.7.0",
        "minimum_wire_compatibility_version": "6.8.0",
        "minimum_index_compatibility_version": "6.0.0-beta1"
}

The directory is mounted with the following parameters

/dev/mapper/vg00-path on /path type xfs (rw,noatime,attr2,inode64,noquota)

Do you have any ideas what could be responsible for the error?

Thanks in advance!

We've seen this a bunch of times before and it's always some other process manipulating the file metadata and truncating the time to the nearest second (and maybe modifying the file in other ways too, that's why Elasticsearch shuts the shard down).

If you watch the write.lock files over time you will see that they start out with some fractional part to their time stamps and then at some point it becomes all zeroes. It's not Elasticsearch doing that, but you'll need to work with your sysadmin folks to track down the actual culprit.

1 Like

Thanks for the answer.

What were the typical processes that manipulated the files in the other cases?

It varies quite widely and could be anything, but often it's something like a backup tool or a security scanner. There is no point in backing up the data paths nor is there any point in scanning it for security purposes. See these docs for more info:

WARNING: Don’t modify anything within the data directory or run processes that might interfere with its contents. If something other than Elasticsearch modifies the contents of the data directory, then Elasticsearch may fail, reporting corruption or other data inconsistencies, or may appear to work correctly having silently lost some of your data. Don’t attempt to take filesystem backups of the data directory; there is no supported way to restore such a backup. Instead, use Snapshot and restore to take backups safely. Don’t run virus scanners on the data directory. A virus scanner can prevent Elasticsearch from working correctly and may modify the contents of the data directory. The data directory contains no executables so a virus scan will only find false positives.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.