During Snapshot Process in ES 5.1.2

Hi,
I have a script to do snapshot process set for a 5.1.2 cluster and it set to snap in every 15 min. This script first snaps if a snap is not already running and once completes successfully , it deletes old snaps from the repo . currently we are keeping last 12 hours of snap in the same repo. Over time the Total snap time has increased from < 15min to > 45 min . and i am seeing a lot of below log messages . Trying to understand what this stack means. Does this mean that this error is actually while deleting the old snaps from the repo or while taking the new snap ? And can you please suggest what could be done to fix this ?
Looking at ES code path - ChecksumBlobStoreFormat.readBlob , seems like its coming while reading the snap file. does it mean that the snap file itself is corrupted ?

[2019-04-29T16:20:31.624Z][WARN][org.elasticsearch.repositories.fs.FsRepository][c908cjr-master-9200] failed to read commit point [snap-FWBZ0G4CQvWLyzUL6opA8Q.dat]
org.elasticsearch.gateway.CorruptStateException: org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=1542443756 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(ChecksumBlobStoreFormat.readBlob(blob="snap-FWBZ0G4CQvWLyzUL6opA8Q.dat")))
at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.readBlob(ChecksumBlobStoreFormat.java:114) ~[elasticsearch-5.1.2.jar:5.1.2]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.buildBlobStoreIndexShardSnapshots(BlobStoreRepository.java:1172) [elasticsearch-5.1.2.jar:5.1.2]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.delete(BlobStoreRepository.java:1013) [elasticsearch-5.1.2.jar:5.1.2]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.delete(BlobStoreRepository.java:959) [elasticsearch-5.1.2.jar:5.1.2]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.deleteSnapshot(BlobStoreRepository.java:435) [elasticsearch-5.1.2.jar:5.1.2]
at org.elasticsearch.snapshots.SnapshotsService.lambda$deleteSnapshotFromRepository$5(SnapshotsService.java:1132) [elasticsearch-5.1.2.jar:5.1.2]
at org.elasticsearch.snapshots.SnapshotsService$$Lambda$1573/2137837030.run(Unknown Source) [elasticsearch-5.1.2.jar:5.1.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:458) [elasticsearch-5.1.2.jar:5.1.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_45]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_45]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_45]
Caused by: org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=1542443756 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(ChecksumBlobStoreFormat.readBlob(blob="snap-FWBZ0G4CQvWLyzUL6opA8Q.dat")))
at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:499) ~[lucene-core-6.3.0.jar:6.3.0 a66a44513ee8191e25b477372094bfa846450316 - shalin - 2016-11-02 19:47:11]
at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:411) ~[lucene-core-6.3.0.jar:6.3.0 a66a44513ee8191e25b477372094bfa846450316 - shalin - 2016-11-02 19:47:11]
at org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:520) ~[lucene-core-6.3.0.jar:6.3.0 a66a44513ee8191e25b477372094bfa846450316 - shalin - 2016-11-02 19:47:11]
at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.readBlob(ChecksumBlobStoreFormat.java:106) ~[elasticsearch-5.1.2.jar:5.1.2]
... 10 more

@ DavidTurner Hi David , sorry for tagging you here but being helpless finding you from a different snapshot related discussion thought to reach out to you in case if you have a pointer .

From hot thread i am seeing a lot of these

0.7% (3.3ms out of 500ms) cpu usage by thread 'elasticsearch[c346syd-data-9202][snapshot][T#3426]'
 10/10 snapshots sharing following 20 elements
   java.lang.Thread.sleep(Native Method)
   java.lang.Thread.sleep(Thread.java:340)
   org.apache.lucene.store.RateLimiter$SimpleRateLimiter.pause(RateLimiter.java:153)
   org.elasticsearch.index.snapshots.blobstore.RateLimitingInputStream.maybePause(RateLimitingInputStream.java:52)
   org.elasticsearch.index.snapshots.blobstore.RateLimitingInputStream.read(RateLimitingInputStream.java:71)
   org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext$AbortableInputStream.read(BlobStoreRepository.java:1421)
   java.io.FilterInputStream.read(FilterInputStream.java:107)
   org.elasticsearch.common.io.Streams.copy(Streams.java:76)
   org.elasticsearch.common.blobstore.fs.FsBlobContainer.writeBlob(FsBlobContainer.java:131)
   org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshotFile(BlobStoreRepository.java:1354)
   org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1296)
   org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:893)
   org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:377)
   org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:87)
   org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:333)
   org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:527)
   org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   java.lang.Thread.run(Thread.java:745)

Can you please help me understand this ?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.