UnsupportedOperationException causes snapshots to fail

We have an Elasticsearch 7.9.1 cluster that suddenly started to fail snapshotting with a strange exception. Here's the details for one of the machines in the cluster:

$ uname -a
Linux [REDACTED] 4.4.0-1095-aws #106-Ubuntu SMP Wed Sep 18 13:33:48 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
$ curl localhost:9200
{
  "name" : "[REDACTED]",
  "cluster_name" : "[REDACTED]",
  "cluster_uuid" : "kHUFcGplQD-WbnpKmaO_9g",
  "version" : {
    "number" : "7.9.1",
    "build_flavor" : "default",
    "build_type" : "deb",
    "build_hash" : "083627f112ba94dffc1232e8b42b73492789ef91",
    "build_date" : "2020-09-01T21:22:21.964974Z",
    "build_snapshot" : false,
    "lucene_version" : "8.6.2",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

Here's the exception details:

java.lang.UnsupportedOperationException: Old formats can't be used for writing
        at org.apache.lucene.codecs.lucene70.Lucene70SegmentInfoFormat.write(Lucene70SegmentInfoFormat.java:273) ~[lucene-backward-codecs-8.6.2.jar:8.6.2 016993b65e393b58246d54e8ddda9f56a453eb0e - ivera - 2020-08-26 10:54:32]
        at org.elasticsearch.snapshots.SourceOnlySnapshot.syncSegment(SourceOnlySnapshot.java:264) ~[?:?]
        at org.elasticsearch.snapshots.SourceOnlySnapshot.syncSnapshot(SourceOnlySnapshot.java:108) ~[?:?]
        at org.elasticsearch.snapshots.SourceOnlySnapshotRepository.snapshotShard(SourceOnlySnapshotRepository.java:170) ~[?:?]
        at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:340) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:256) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:651) [elasticsearch-7.9.1.jar:7.9.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:832) [?:?]

This exception is coming directly from Lucene (see the line in question here).

This seems to have started happening when we updated to version 7.9.0. I've tried deleting the most recent snapshot, starting over with a fresh repository, and updating to 7.9.1, but none of this changed the error.

The index in question has segments with the following Lucene versions:

8.4.0
8.5.1
8.6.0
8.6.2 (this makes up the vast majority of segments)

All the segments in versions < 8.6.2 are committed.

The oldest snapshot in our repository was created in ES version 7.4.0, but given that using a completely blank repository doesn't work, I'm inclined to believe this is a different issue.

Trying to snapshot again results in the following exception:

java.nio.file.FileAlreadyExistsException: /mnt/istore/elasticsearch/data/nodes/0/indices/vDAnJnfiTkWOZ7nOv4GMDw/126/_snapshot/_phg7.fnm
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:94) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) ~[?:?]
        at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:219) ~[?:?]
        at java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:478) ~[?:?]
        at java.nio.file.Files.newOutputStream(Files.java:224) ~[?:?]
        at org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:410) ~[lucene-core-8.6.2.jar:8.6.2 016993b65e393b58246d54e8ddda9f56a453eb0e - ivera - 2020-08-26 10:53:36]
        at org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:406) ~[lucene-core-8.6.2.jar:8.6.2 016993b65e393b58246d54e8ddda9f56a453eb0e - ivera - 2020-08-26 10:53:36]
        at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:254) ~[lucene-core-8.6.2.jar:8.6.2 016993b65e393b58246d54e8ddda9f56a453eb0e - ivera - 2020-08-26 10:53:36]
        at org.elasticsearch.snapshots.SourceOnlySnapshot$LinkedFilesDirectory.createOutput(SourceOnlySnapshot.java:352) ~[?:?]
        at org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:43) ~[lucene-core-8.6.2.jar:8.6.2 016993b65e393b58246d54e8ddda9f56a453eb0e - ivera - 2020-08-26 10:53:36]
        at org.apache.lucene.codecs.lucene60.Lucene60FieldInfosFormat.write(Lucene60FieldInfosFormat.java:272) ~[lucene-core-8.6.2.jar:8.6.2 016993b65e393b58246d54e8ddda9f56a453eb0e - ivera - 2020-08-26 10:53:36]
        at org.elasticsearch.snapshots.SourceOnlySnapshot.syncSegment(SourceOnlySnapshot.java:229) ~[x-pack-core-7.9.1.jar:7.9.1]
        at org.elasticsearch.snapshots.SourceOnlySnapshot.syncSnapshot(SourceOnlySnapshot.java:108) ~[x-pack-core-7.9.1.jar:7.9.1]
        at org.elasticsearch.snapshots.SourceOnlySnapshotRepository.snapshotShard(SourceOnlySnapshotRepository.java:170) [x-pack-core-7.9.1.jar:7.9.1]
        at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:340) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:256) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:651) [elasticsearch-7.9.1.jar:7.9.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:832) [?:?]

This leads me to believe that this exception isn't handled properly, so it results in some bad leftover state.

At this point, it seems like my only hope is to rebuild the index, but it's quite large and that's something I'd like to avoid. Is it possible this is a bug introduced in a recent version?

Thanks,
Olivia
Academia.edu

Hi @OliviaTrewin

I think there is a backwards compatibility issue with the data structures the source-only snapshot keeps on disk between snapshots here and an outdated version of this data is not getting cleaned up.
I'm trying to reproduce this currently and will open an issue for it once I have. I think you should be able to work around this problem by deleting the _snapshot directories in your shard folders like e.g. /mnt/istore/elasticsearch/data/nodes/0/indices/vDAnJnfiTkWOZ7nOv4GMDw/126/_snapshot so that the on-disk data structures are re-created with a supported Lucene version.

EDIT: I opened https://github.com/elastic/elasticsearch/issues/62700 to track a fix for this.

Hi @Armin_Braun,

I did try deleting those folders - it solves the FileAlreadyExistsException, but the UnsupportedOpperationException persists.

Thanks for testing this @OliviaTrewin!
Unfortunately it appears snapshotting older/BwC version (Lucene version wise) indices is currently broken and will require a fix for the linked issue.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.