Elasticsearch snapshot repository fails validation

I have a three node Elasticsearch cluster with a shared NFS volume. I have set a path.repo value in elasticsearch.yml and restarted elasticsearch (one node at a time, my cluster needs to stay up). I have confirmed all three nodes can read from and write to this path like so:

Node1:

root@facility-es01:~# su -s /bin/bash - elasticsearch
su: warning: cannot change directory to /nonexistent: No such file or directory
elasticsearch@facility-es01:/root$ grep repo /etc/elasticsearch/elasticsearch.yml
path.repo: "/tmp/elastic"
elasticsearch@facility-es01:/root$ echo 'test1' >> /tmp/elastic/test
elasticsearch@facility-es01:/root$ ls -lha /tmp/elastic/
total 5.0K
drwxrwxrwx  2 elasticsearch elasticsearch    3 Jul 26 10:12 .
drwxrwxrwt 11 root          root          4.0K Jul 26 10:13 ..
-rw-rw-rw-  1 elasticsearch elasticsearch    6 Jul 26 10:12 test

Node2:

root@facility-es02:~# su -s /bin/bash - elasticsearch
su: warning: cannot change directory to /nonexistent: No such file or directory
elasticsearch@facility-es02:/root$ grep repo /etc/elasticsearch/elasticsearch.yml
path.repo: "/tmp/elastic"
elasticsearch@facility-es02:/root$ ls -lha /tmp/elastic/
total 5.0K
drwxrwxrwx  2 elasticsearch elasticsearch    3 Jul 26 10:12 .
drwxrwxrwt 11 root          root          4.0K Jul 26 10:15 ..
-rw-rw-rw-  1 elasticsearch elasticsearch    6 Jul 26 10:12 test
elasticsearch@facility-es02:/root$ cat /tmp/elastic/test
test1
elasticsearch@facility-es02:/root$ echo 'test2' >> /tmp/elastic/test
elasticsearch@facility-es02:/root$

Node3:

root@facility-es03:~# su -s /bin/bash - elasticsearch
su: warning: cannot change directory to /nonexistent: No such file or directory
elasticsearch@facility-es03:/root$ grep repo /etc/elasticsearch/elasticsearch.yml
path.repo: "/tmp/elastic"
elasticsearch@facility-es03:/root$ ls -lha /tmp/elastic/
total 5.0K
drwxrwxrwx  2 elasticsearch elasticsearch    3 Jul 26 10:12 .
drwxrwxrwt 11 root          root          4.0K Jul 26 10:16 ..
-rw-rw-rw-  1 elasticsearch elasticsearch   12 Jul 26 10:15 test
elasticsearch@facility-es03:/root$ echo 'test3' >> /tmp/elastic/test
elasticsearch@facility-es03:/root$ cat /tmp/elastic/test
test1
test2
test3

However when I attempt to verify the repository I still get the error message like they cannot write to that path like this:

root@facility-es03:~# curl -s -XPOST 'localhost:9200/_snapshot/backup1/_verify'| jq
{
  "error": {
    "root_cause": [
      {
        "type": "repository_verification_exception",
        "reason": "[backup1] [[_AAzUpbdSLmH8Eu2fATC8w, 'RemoteTransportException[[facility-es03.ftc][10.40.8.53:9300][internal:admin/repository/verify]]; nested: RepositoryVerificationException[[backup1] a file written by master to the store [/tmp/elastic] cannot be accessed on the node [{facility-es03.ftc}{_AAzUpbdSLmH8Eu2fATC8w}{ZHNN9EcmTjWTzmcyDt-h_g}{10.40.8.53}{10.40.8.53:9300}{cdhilmrstw}{ml.machine_memory=12256530432, xpack.installed=true, transform.node=true, ml.max_open_jobs=20, ml.max_jvm_size=8589934592}]. This might indicate that the store [/tmp/elastic] is not shared between this node and the master node or that permissions on the store don't allow reading files written by the master node]; nested: NoSuchFileException[/tmp/elastic/tests-NDjRPrGHQD6AHCURtQV5Hg/master.dat];'], [PFSVaur8STWZap533rLEXg, 'RemoteTransportException[[facility-es01.ftc][10.40.8.51:9300][internal:admin/repository/verify]]; nested: RepositoryVerificationException[[backup1] a file written by master to the store [/tmp/elastic] cannot be accessed on the node [{facility-es01.ftc}{PFSVaur8STWZap533rLEXg}{_WH1JcQzQySVGxICnW8O7w}{10.40.8.51}{10.40.8.51:9300}{cdhilmrstw}{ml.machine_memory=12594253824, xpack.installed=true, transform.node=true, ml.max_open_jobs=20, ml.max_jvm_size=8589934592}]. This might indicate that the store [/tmp/elastic] is not shared between this node and the master node or that permissions on the store don't allow reading files written by the master node]; nested: NoSuchFileException[/tmp/elastic/tests-NDjRPrGHQD6AHCURtQV5Hg/master.dat];']]"
      }
    ],
    "type": "repository_verification_exception",
    "reason": "[backup1] [[_AAzUpbdSLmH8Eu2fATC8w, 'RemoteTransportException[[facility-es03.ftc][10.40.8.53:9300][internal:admin/repository/verify]]; nested: RepositoryVerificationException[[backup1] a file written by master to the store [/tmp/elastic] cannot be accessed on the node [{facility-es03.ftc}{_AAzUpbdSLmH8Eu2fATC8w}{ZHNN9EcmTjWTzmcyDt-h_g}{10.40.8.53}{10.40.8.53:9300}{cdhilmrstw}{ml.machine_memory=12256530432, xpack.installed=true, transform.node=true, ml.max_open_jobs=20, ml.max_jvm_size=8589934592}]. This might indicate that the store [/tmp/elastic] is not shared between this node and the master node or that permissions on the store don't allow reading files written by the master node]; nested: NoSuchFileException[/tmp/elastic/tests-NDjRPrGHQD6AHCURtQV5Hg/master.dat];'], [PFSVaur8STWZap533rLEXg, 'RemoteTransportException[[facility-es01.ftc][10.40.8.51:9300][internal:admin/repository/verify]]; nested: RepositoryVerificationException[[backup1] a file written by master to the store [/tmp/elastic] cannot be accessed on the node [{facility-es01.ftc}{PFSVaur8STWZap533rLEXg}{_WH1JcQzQySVGxICnW8O7w}{10.40.8.51}{10.40.8.51:9300}{cdhilmrstw}{ml.machine_memory=12594253824, xpack.installed=true, transform.node=true, ml.max_open_jobs=20, ml.max_jvm_size=8589934592}]. This might indicate that the store [/tmp/elastic] is not shared between this node and the master node or that permissions on the store don't allow reading files written by the master node]; nested: NoSuchFileException[/tmp/elastic/tests-NDjRPrGHQD6AHCURtQV5Hg/master.dat];']]"
  },
  "status": 500
}

What am I missing?

One big difference between the test you ran and what Elasticsearch does is timing: The node reads the master.dat file very soon after it's been written, maybe just a few milliseconds later. It's not doing anything magical however, it just creates the file on one node and then tries to open it on the other one, so if this were the issue then it's something that needs fixing in your NFS installation.

Thanks for taking a look. Do you have any recommendations on how I'd even begin troubleshooting that? It's a pretty bog standard Debian server running the nfs-kernel-server service. I've even added the sync option to my export to ensure the NFS server does not return the write as "finished" prior to actually having it (and presumably being able to serve that out to other nodes). It seems to me that so long as elasticsearch isn't checking for the existence of the files on other nodes prior to the write finishing on the master (which would be wrong behavior imo) that this shouldn't be an issue.

This probably isn't the right place to get in-depth help with troubleshooting NFS consistency issues; at least, the main help I myself can offer is really just explaining what Elasticsearch is doing in terms that a NFS expert might understand. In detail Elasticsearch does these steps before broadcasting the verification request to the other nodes:

  • open, write, and close a small temp file
  • fsync the file
  • rename the file to master.dat
  • fsync the containing directory

You could try reproducing the problem outside Elasticsearch with a little network service that does the same thing.

I've run a tcpdump from the NFS server to see if maybe there are timing issues and it doesn't appear like that's the issue here. I started the packet trace, then tried a verify from facility-es02 (which is currently my master) again:

root@backup0:/pool0# tcpdump -vvv -i bond0 '(host facility-es01 or host facility-es02 or host facility-es03)'
tcpdump: listening on bond0, link-type EN10MB (Ethernet), capture size 262144 bytes
10:52:43.128092 IP (tos 0x0, ttl 64, id 4917, offset 0, flags [DF], proto TCP (6), length 52)
    facility-es02.ftc.700 > backup0.ftc.nfs: Flags [.], cksum 0x7896 (correct), seq 1387082109, ack 1903143428, win 1183, options [nop,nop,TS val 1944639737 ecr 2979736607], length 0
10:52:43.128122 IP (tos 0x0, ttl 64, id 45593, offset 0, flags [DF], proto TCP (6), length 52)
    backup0.ftc.nfs > facility-es02.ftc.700: Flags [.], cksum 0x24b6 (incorrect -> 0xef87), seq 1, ack 1, win 1452, options [nop,nop,TS val 2979767327 ecr 1944578298], length 0
10:52:43.128142 IP (tos 0x0, ttl 64, id 4918, offset 0, flags [DF], proto TCP (6), length 180)
    facility-es02.ftc.700 > backup0.ftc.nfs: Flags [P.], cksum 0x97c3 (correct), seq 1:129, ack 1, win 1183, options [nop,nop,TS val 1944639737 ecr 2979736607], length 128: NFS request xid 2310890554 124 getattr fh 0,2/53
10:52:43.128264 IP (tos 0x0, ttl 64, id 45594, offset 0, flags [DF], proto TCP (6), length 136)
    backup0.ftc.nfs > facility-es02.ftc.700: Flags [P.], cksum 0x250a (incorrect -> 0xb600), seq 1:85, ack 129, win 1452, options [nop,nop,TS val 2979767328 ecr 1944639737], length 84: NFS reply xid 2310890554 reply ok 80 getattr NON 1 ids 0/549634141 sz -663226682
10:52:43.128356 IP (tos 0x0, ttl 64, id 4919, offset 0, flags [DF], proto TCP (6), length 52)
    facility-es02.ftc.700 > backup0.ftc.nfs: Flags [.], cksum 0xffbe (correct), seq 129, ack 85, win 1183, options [nop,nop,TS val 1944639738 ecr 2979767328], length 0
...<several seconds elapse with no more packets. The test has already failed>...

As you can see the first packet is the master sending the write to the NFS server, the second is the NFS server responding to the write, but none of the other hosts ever send any packets to the NFS server. It appears in this case that elasticsearch isn't properly checking the test files exist at all before "failing" the test. Is this sufficient to prove that the issue is with elasticsearch and not the NFS server?

I don't think so, Elasticsearch doesn't actually care what the underlying filesystem is, it's just calling open() to read the master.dat file. You should be able to verify this with strace. All the network traffic and so on is the responsibility of the NFS config.

strace is a good idea and actually proves my point that it never even tries to verify the files exist from any of the nodes. I run this and start a repo verify on my master and it never prints anything:

root@facility-es03:~# ps faux | grep elastic
root      7330  0.0  0.0   6076   884 pts/0    S+   11:08   0:00          \_ grep elastic
elastic+  5354 32.2 76.3 31660156 9314560 ?    Ssl  Jul23 1298:59 /usr/share/elasticsearch/jdk/bin/java -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -XX:+ShowCodeDetailsInExceptionMessages -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.locale.providers=SPI,COMPAT -Dfile.encoding=UTF-8 -Dio.netty.noKeySetOptimization=true -Dio.netty.noUnsafe=true -Dio.netty.recycler.maxCapacityPerThread=0 -Djava.awt.headless=true -Djna.nosys=true -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -XX:+AlwaysPreTouch -XX:+HeapDumpOnOutOfMemoryError -XX:-OmitStackTraceInFastThrow -Xms8g -Xmx8g -Xss1m -server -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -XX:MaxGCPauseMillis=100 -XX:MaxDirectMemorySize=4294967296 -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/etc/elasticsearch -Des.distribution.flavor=default -Des.distribution.type=deb -Des.bundled_jdk=true -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -p /var/run/elasticsearch/elasticsearch.pid --quiet
elastic+  5577  0.0  0.0 110436  5180 ?        Sl   Jul23   0:00  \_ /usr/share/elasticsearch/modules/x-pack-ml/platform/linux-x86_64/bin/controller
root@facility-es03:~# strace -e trace=open -p 5354
strace: Process 5354 attached

So it really does seem like elasticsearch is not even attempting to validate the repo on any of the nodes.

Ah one thing I failed to mention is that it first creates a directory.

I can't believe it does all those steps with a single 128-byte packet.

I suspect that strace without a -f will only look for syscalls on the main thread, which in Elasticsearch does nothing.

Ah I forgot about -f thanks. Adding that flag does in fact provide much more output, but I'm not sure what to do about segfaults in ES threads:

root@facility-es03:~# strace -e trace=open -f -p 5354
strace: Process 5354 attached with 100 threads
[pid  5837] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
strace: Process 8809 attached
strace: Process 8810 attached
[pid  8810] +++ exited with 0 +++
[pid  8809] +++ exited with 0 +++
[pid  5965] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
strace: Process 8828 attached
[pid  8828] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  8828] +++ exited with 0 +++
[pid  5924] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  6001] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  6000] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  5965] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
strace: Process 8851 attached
[pid  8851] +++ exited with 0 +++
[pid  5924] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  5837] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  6000] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  5999] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  5924] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  6001] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
strace: Process 8904 attached
[pid  8904] +++ exited with 0 +++
strace: Process 8931 attached
[pid  8931] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  8931] +++ exited with 0 +++
[pid  6045] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
strace: Process 8971 attached
[pid  8971] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  8971] +++ exited with 0 +++
[pid  5966] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  6045] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
strace: Process 9006 attached
[pid  9006] +++ exited with 0 +++
[pid  5837] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  5924] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
strace: Process 9049 attached
strace: Process 9050 attached
[pid  9049] +++ exited with 0 +++
[pid  9050] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  9050] +++ exited with 0 +++
[pid  5737] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  5848] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
strace: Process 9087 attached
[pid  9087] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  5924] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  9087] +++ exited with 0 +++
[pid  5800] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  5925] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  5924] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  5924] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  5924] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
strace: Process 9111 attached
[pid  9111] +++ exited with 0 +++
strace: Process 9117 attached
[pid  9117] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  9117] +++ exited with 0 +++
[pid  5925] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  5737] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
strace: Process 9143 attached
[pid  9143] +++ exited with 0 +++
[pid  5790] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
strace: Process 9150 attached
[pid  9150] +++ exited with 0 +++
[pid  5966] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
strace: Process 9186 attached
[pid  6001] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  6000] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  5965] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  9186] +++ exited with 0 +++
[pid  5737] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  6045] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
strace: Process 9217 attached
[pid  9217] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  9217] +++ exited with 0 +++
[pid  5790] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  5924] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
strace: Process 9232 attached
[pid  9232] +++ exited with 0 +++
[pid  5924] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  5965] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
strace: Process 9335 attached
[pid  9335] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  9335] +++ exited with 0 +++
strace: Process 9368 attached
[pid  9368] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  5924] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
strace: Process 9393 attached
[pid  9393] +++ exited with 0 +++
[pid  9368] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  9368] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  5999] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  5924] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  9368] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  9368] +++ exited with 0 +++
strace: Process 9420 attached
[pid  9420] +++ exited with 0 +++
[pid  5965] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  5790] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---
[pid  5924] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7fcfac195008} ---

Something seems seriously wrong.

Isn't that normal when you're using mmap a lot (as Elasticsearch does)?

Edit: actually maybe not, mmap and page faults go together but probably not SIGSEGV. ISTR this could be due to JVM trickery instead: sometimes the JIT compiler will decide that a code path is never hit and replace it with an illegal memory access so that if we hit the path in the future then it raises a SIGSEGV which triggers a recompilation.

I'm honestly not sure as I'm kinda reaching the limits of my programming knowledge. If that's expected behavior, I'll ignore it. I'm able to filter out a lot of the noise by adding -y to strace and grepping for the path like this:

root@facility-es03:~# strace -e trace=open,write -f -y -p 5354 2>&1 | grep '/tmp/elastic'
[pid 15325] write(1410</tmp/elastic/tests-XH_1Vly2Tfubeb2qnMrhmA/data-_AAzUpbdSLmH8Eu2fATC8w.dat>, "XH_1Vly2Tfubeb2qnMrhmA", 22) = 22

That single line does show up every time I POST 'localhost:9200/_snapshot/backup1/_verify', which does seems correct. I'm not sure how to make my NFS server to behave more correctly than simply adding sync but I suppose I'm back to that again. I'm not really sure what else to do. This NFS server has a number of exports to other unrelated (non-elasticsearch) clustered services like a hypervisor that don't seem to have any timing issues. Perhaps maybe in the case of a verification it might make sense for elasticsearch to look for the files a few times with a backoff style wait between each verification before fully "failing" the check? It doesn't make a ton of sense that it needs to be this sensitive to sync issues for a simple snapshot repository.

The trouble is that if the repository doesn't have proper read-after-write semantics then other things will fail in the future, likely leading to corrupt or lost snapshots. Better to find out about it now.

The proper solution then is to implement file level locking within elasticsearch like most other things do as it appears from my searching that most people consider NFS locking unreliable.

Which is really my way of saying if I want that feature I should submit a pull request. In the meantime I guess I'll try and figure out a way to get my NFS server to behave better.

File locking in NFS is indeed tricky to configure correctly, but that's not relevant here, there are no locks in play, nor could this be solved by a different lock implementation.

For what it's worth, remounting the NFS volume on my clients with the sync,noac,lookupcache=none options set did not help. If anyone has any suggestions on how else to make this work I'm all ears. Those options should remove ALL caching from the client side, and on the server the export itself also has sync. I don't know where else this could be getting confused.

Here is a bit of fun too showing that a write from a different server to a file shows up when only a mere five thousandth of a second has elapsed:

root@facility-es02:~# ssh facility-es03 'date +"%T.%N" > /tmp/elastic/test2' && date +"%T.%N"; cat /tmp/elastic/test2
13:47:27.612315458
13:47:27.607783656

Just how fast is Elasticsearch checking? Is it really faster than this?

Just for a sense of scale, here is the nptq output for both those nodes:

root@facility-es02:~# ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*ntpserver.ftc   .GPS.            1 u  907 1024  377    0.384   -0.229   0.287
+paradox-02.ftc  10.40.200.33     2 u  100 1024  377    0.185   -0.055   0.144
+paradox-01.ftc  10.40.200.33     2 u 1053 1024  377    0.287   -0.069   0.086
root@facility-es03:~# ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*ntpserver.ftc   .GPS.            1 u   90 1024  377    0.470   -0.066   0.231
+paradox-02.ftc  10.40.200.33     2 u  951 1024  377    0.166    0.097   0.192
+paradox-01.ftc  10.40.200.33     2 u  901 1024  377    0.228    0.121   0.070

The delta there is 4.5ms which is a 220th of a second. It's certainly plausible that Elasticsearch is faster than that, sub-millisecond responses are possible.