Snapshot thread are in sleep

I am seeing a lot of snapshot threads are in sleep from hot threads .. what does this mean. the FS repository has all default settings. Also i am seeing a lot of snapshot threads are in active and in queue from thread pool. But i only have snapshot script running from master nodes (3) . There are 47 data nodes across 13 machines .

2.4% (12.1ms out of 500ms) cpu usage by thread 'elasticsearch[c346syd-data-9203][snapshot][T#3549]'
10/10 snapshots sharing following 20 elements
java.lang.Thread.sleep(Native Method)
java.lang.Thread.sleep(Thread.java:340)
org.apache.lucene.store.RateLimiter$SimpleRateLimiter.pause(RateLimiter.java:153)
org.elasticsearch.index.snapshots.blobstore.RateLimitingInputStream.maybePause(RateLimitingInputStream.java:52)
org.elasticsearch.index.snapshots.blobstore.RateLimitingInputStream.read(RateLimitingInputStream.java:71)
org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext$AbortableInputStream.read(BlobStoreRepository.java:1421)
java.io.FilterInputStream.read(FilterInputStream.java:107)
org.elasticsearch.common.io.Streams.copy(Streams.java:76)
org.elasticsearch.common.blobstore.fs.FsBlobContainer.writeBlob(FsBlobContainer.java:131)
org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshotFile(BlobStoreRepository.java:1354)
org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1296)
org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:893)
org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:377)
org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:87)
org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:333)
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:527)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)

2.4% (12.1ms out of 500ms) cpu usage by thread 'elasticsearch[c346syd-data-9203][snapshot][T#3546]'
 10/10 snapshots sharing following 20 elements
   java.lang.Thread.sleep(Native Method)
   java.lang.Thread.sleep(Thread.java:340)
   org.apache.lucene.store.RateLimiter$SimpleRateLimiter.pause(RateLimiter.java:153)
   org.elasticsearch.index.snapshots.blobstore.RateLimitingInputStream.maybePause(RateLimitingInputStream.java:52)

Trying to understand how snapshot works under the hood . Issue is we are seeing the snapshot time is increasing over time... now its at ~3 hours from 15 min...and it is tending to snap a large backup file too. we only keep last 3 snaps in the same repo

Snapshots are throttled to 40MB per second, see https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html (setting max_snapshot_bytes_per_sec) - I guess you are running into this limitation and thus the writing is delayed. You can disable this limit by setting it to 0 or increase it and then retry.

Thanks for the response. also we are seeing below error on all successfully completed Snapshots when i am trying to do GET snapshot/CaymanAuditQEDB/<any_ snapshot_name>/_status.
We were seeing the same error while trying to delete old snapshots too. and also though it says Snapshots completed , upon restoring these snaps , clusters changes to red with below error

[2019-06-07T13:38:25.674Z][INFO][org.elasticsearch.cluster.routing.allocation.AllocationService][c732uyu-master-9200] Cluster health status changed from [YELLOW] to [RED] (reason: [async_shard_fetch])

{
"error": {
"root_cause": [
{
"type": "index_shard_restore_failed_exception",
"reason": "failed to read shard snapshot file",
"index_uuid": "210psLqQS6mNFZN-AzAhYA",
"shard": "5",
"index": "audit_qedshared_5519fbef53402c50e10080000acd0255_aa50565206214c59beed18819f3db24b-000001"
}
],
"type": "index_shard_restore_failed_exception",
"reason": "failed to read shard snapshot file",
"index_uuid": "210psLqQS6mNFZN-AzAhYA",
"shard": "5",
"index": "audit_qedshared_5519fbef53402c50e10080000acd0255_aa50565206214c59beed18819f3db24b-000001",
"caused_by": {
"type": "i_o_exception",
"reason": "org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=958125580 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(ChecksumBlobStoreFormat.readBlob(blob="snap-eK0kNVeHSPm91PsWzrXbrg.dat")))",
"caused_by": {
"type": "corrupt_index_exception",
"reason": "codec footer mismatch (file truncated?): actual footer=958125580 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(ChecksumBlobStoreFormat.readBlob(blob="snap-eK0kNVeHSPm91PsWzrXbrg.dat")))"
}
}
},
"status": 500
}

this index looks healthy though
any pointers ?

Please take your time to properly format your messages with markdown, this makes them 10x easier to read. Thanks!

It seems that a file has been corrupted on the snapshot disk or has not been properly written. Is this happening with all of your snapshots? Also what Elasticsearch version are you using and what type of snapshot repo is this?

Yes, This is happening with all my snapshots. ES version is 5.1.2 and its a Shared File System Repository. During the snapshot process we don't see any error/warnings . But when we issue delete snapshot , we see "codec footer mismatch (file truncated?)" in the ES log.

do you have any chance to try this with the latest 5.x release? I quick github PR search uncovered a few snapshot/restore issues being fixed.

Also can you possibly snapshot to a different filesystem to make sure there is no harddisk corruption?

We can not change the ES version currently .... but will try to take a snapshot to a different filesystem.

Question about the "codec footer mismatch (file truncated?)" , I have noticed that all these warning are coming from specific Index and shard. Do you think it could be something with the index ? But we can do search on this index just fine though.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.