UnsupportedOperationException causes snapshots to fail

OliviaTrewin · September 18, 2020, 9:48pm

We have an Elasticsearch 7.9.1 cluster that suddenly started to fail snapshotting with a strange exception. Here's the details for one of the machines in the cluster:

$ uname -a
Linux [REDACTED] 4.4.0-1095-aws #106-Ubuntu SMP Wed Sep 18 13:33:48 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
$ curl localhost:9200
{
  "name" : "[REDACTED]",
  "cluster_name" : "[REDACTED]",
  "cluster_uuid" : "kHUFcGplQD-WbnpKmaO_9g",
  "version" : {
    "number" : "7.9.1",
    "build_flavor" : "default",
    "build_type" : "deb",
    "build_hash" : "083627f112ba94dffc1232e8b42b73492789ef91",
    "build_date" : "2020-09-01T21:22:21.964974Z",
    "build_snapshot" : false,
    "lucene_version" : "8.6.2",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

Here's the exception details:

java.lang.UnsupportedOperationException: Old formats can't be used for writing
        at org.apache.lucene.codecs.lucene70.Lucene70SegmentInfoFormat.write(Lucene70SegmentInfoFormat.java:273) ~[lucene-backward-codecs-8.6.2.jar:8.6.2 016993b65e393b58246d54e8ddda9f56a453eb0e - ivera - 2020-08-26 10:54:32]
        at org.elasticsearch.snapshots.SourceOnlySnapshot.syncSegment(SourceOnlySnapshot.java:264) ~[?:?]
        at org.elasticsearch.snapshots.SourceOnlySnapshot.syncSnapshot(SourceOnlySnapshot.java:108) ~[?:?]
        at org.elasticsearch.snapshots.SourceOnlySnapshotRepository.snapshotShard(SourceOnlySnapshotRepository.java:170) ~[?:?]
        at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:340) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:256) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:651) [elasticsearch-7.9.1.jar:7.9.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:832) [?:?]

This exception is coming directly from Lucene (see the line in question here).

This seems to have started happening when we updated to version 7.9.0. I've tried deleting the most recent snapshot, starting over with a fresh repository, and updating to 7.9.1, but none of this changed the error.

The index in question has segments with the following Lucene versions:

8.4.0
8.5.1
8.6.0
8.6.2 (this makes up the vast majority of segments)

All the segments in versions < 8.6.2 are committed.

The oldest snapshot in our repository was created in ES version 7.4.0, but given that using a completely blank repository doesn't work, I'm inclined to believe this is a different issue.

Trying to snapshot again results in the following exception:

java.nio.file.FileAlreadyExistsException: /mnt/istore/elasticsearch/data/nodes/0/indices/vDAnJnfiTkWOZ7nOv4GMDw/126/_snapshot/_phg7.fnm
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:94) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) ~[?:?]
        at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:219) ~[?:?]
        at java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:478) ~[?:?]
        at java.nio.file.Files.newOutputStream(Files.java:224) ~[?:?]
        at org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:410) ~[lucene-core-8.6.2.jar:8.6.2 016993b65e393b58246d54e8ddda9f56a453eb0e - ivera - 2020-08-26 10:53:36]
        at org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:406) ~[lucene-core-8.6.2.jar:8.6.2 016993b65e393b58246d54e8ddda9f56a453eb0e - ivera - 2020-08-26 10:53:36]
        at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:254) ~[lucene-core-8.6.2.jar:8.6.2 016993b65e393b58246d54e8ddda9f56a453eb0e - ivera - 2020-08-26 10:53:36]
        at org.elasticsearch.snapshots.SourceOnlySnapshot$LinkedFilesDirectory.createOutput(SourceOnlySnapshot.java:352) ~[?:?]
        at org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:43) ~[lucene-core-8.6.2.jar:8.6.2 016993b65e393b58246d54e8ddda9f56a453eb0e - ivera - 2020-08-26 10:53:36]
        at org.apache.lucene.codecs.lucene60.Lucene60FieldInfosFormat.write(Lucene60FieldInfosFormat.java:272) ~[lucene-core-8.6.2.jar:8.6.2 016993b65e393b58246d54e8ddda9f56a453eb0e - ivera - 2020-08-26 10:53:36]
        at org.elasticsearch.snapshots.SourceOnlySnapshot.syncSegment(SourceOnlySnapshot.java:229) ~[x-pack-core-7.9.1.jar:7.9.1]
        at org.elasticsearch.snapshots.SourceOnlySnapshot.syncSnapshot(SourceOnlySnapshot.java:108) ~[x-pack-core-7.9.1.jar:7.9.1]
        at org.elasticsearch.snapshots.SourceOnlySnapshotRepository.snapshotShard(SourceOnlySnapshotRepository.java:170) [x-pack-core-7.9.1.jar:7.9.1]
        at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:340) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:256) [elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:651) [elasticsearch-7.9.1.jar:7.9.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:832) [?:?]

This leads me to believe that this exception isn't handled properly, so it results in some bad leftover state.

At this point, it seems like my only hope is to rebuild the index, but it's quite large and that's something I'd like to avoid. Is it possible this is a bug introduced in a recent version?

Thanks,
Olivia
Academia.edu

Armin_Braun · September 21, 2020, 12:44pm

Hi @OliviaTrewin

I think there is a backwards compatibility issue with the data structures the source-only snapshot keeps on disk between snapshots here and an outdated version of this data is not getting cleaned up.
I'm trying to reproduce this currently and will open an issue for it once I have. I think you should be able to work around this problem by deleting the _snapshot directories in your shard folders like e.g. /mnt/istore/elasticsearch/data/nodes/0/indices/vDAnJnfiTkWOZ7nOv4GMDw/126/_snapshot so that the on-disk data structures are re-created with a supported Lucene version.

EDIT: I opened https://github.com/elastic/elasticsearch/issues/62700 to track a fix for this.

OliviaTrewin · September 21, 2020, 3:55pm

Hi @Armin_Braun,

I did try deleting those folders - it solves the FileAlreadyExistsException, but the UnsupportedOpperationException persists.

Armin_Braun · September 22, 2020, 2:25pm

Thanks for testing this @OliviaTrewin!
Unfortunately it appears snapshotting older/BwC version (Lucene version wise) indices is currently broken and will require a fix for the linked issue.

system · October 20, 2020, 2:25pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Failed indices in snapshot since upgrade to 8.5.3 Elasticsearch snapshot-and-restore	8	674	February 8, 2023
UnsupportedOperationException raised during "DFS query then fetch" with range query in rescorer Elasticsearch	1	370	July 6, 2017
Very often FileSystemException : Operation not permitted Elasticsearch	13	2686	May 25, 2021
Format version is not supported (resource BufferedChecksumIndexInput (SimpleFSIndexInput)) Elasticsearch reindex	3	565	July 17, 2023
Sanpshot by Type:URL Elasticsearch	14	2274	July 5, 2017

UnsupportedOperationException causes snapshots to fail

Related topics