Snapshot thread are in sleep

saikettewary · June 5, 2019, 1:10pm

I am seeing a lot of snapshot threads are in sleep from hot threads .. what does this mean. the FS repository has all default settings. Also i am seeing a lot of snapshot threads are in active and in queue from thread pool. But i only have snapshot script running from master nodes (3) . There are 47 data nodes across 13 machines .

2.4% (12.1ms out of 500ms) cpu usage by thread 'elasticsearch[c346syd-data-9203][snapshot][T#3549]'
10/10 snapshots sharing following 20 elements
java.lang.Thread.sleep(Native Method)
java.lang.Thread.sleep(Thread.java:340)
org.apache.lucene.store.RateLimiter$SimpleRateLimiter.pause(RateLimiter.java:153)
org.elasticsearch.index.snapshots.blobstore.RateLimitingInputStream.maybePause(RateLimitingInputStream.java:52)
org.elasticsearch.index.snapshots.blobstore.RateLimitingInputStream.read(RateLimitingInputStream.java:71)
org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext$AbortableInputStream.read(BlobStoreRepository.java:1421)
java.io.FilterInputStream.read(FilterInputStream.java:107)
org.elasticsearch.common.io.Streams.copy(Streams.java:76)
org.elasticsearch.common.blobstore.fs.FsBlobContainer.writeBlob(FsBlobContainer.java:131)
org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshotFile(BlobStoreRepository.java:1354)
org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1296)
org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:893)
org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:377)
org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:87)
org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:333)
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:527)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)

2.4% (12.1ms out of 500ms) cpu usage by thread 'elasticsearch[c346syd-data-9203][snapshot][T#3546]'
 10/10 snapshots sharing following 20 elements
   java.lang.Thread.sleep(Native Method)
   java.lang.Thread.sleep(Thread.java:340)
   org.apache.lucene.store.RateLimiter$SimpleRateLimiter.pause(RateLimiter.java:153)
   org.elasticsearch.index.snapshots.blobstore.RateLimitingInputStream.maybePause(RateLimitingInputStream.java:52)

saikettewary · June 5, 2019, 1:11pm

Trying to understand how snapshot works under the hood . Issue is we are seeing the snapshot time is increasing over time... now its at ~3 hours from 15 min...and it is tending to snap a large backup file too. we only keep last 3 snaps in the same repo

spinscale · June 7, 2019, 12:51pm

Snapshots are throttled to 40MB per second, see https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html (setting max_snapshot_bytes_per_sec) - I guess you are running into this limitation and thus the writing is delayed. You can disable this limit by setting it to 0 or increase it and then retry.

saikettewary · June 7, 2019, 1:58pm

Thanks for the response. also we are seeing below error on all successfully completed Snapshots when i am trying to do GET snapshot/CaymanAuditQEDB/<any_ snapshot_name>/_status.
We were seeing the same error while trying to delete old snapshots too. and also though it says Snapshots completed , upon restoring these snaps , clusters changes to red with below error

[2019-06-07T13:38:25.674Z][INFO][org.elasticsearch.cluster.routing.allocation.AllocationService][c732uyu-master-9200] Cluster health status changed from [YELLOW] to [RED] (reason: [async_shard_fetch])

{
"error": {
"root_cause": [
{
"type": "index_shard_restore_failed_exception",
"reason": "failed to read shard snapshot file",
"index_uuid": "210psLqQS6mNFZN-AzAhYA",
"shard": "5",
"index": "audit_qedshared_5519fbef53402c50e10080000acd0255_aa50565206214c59beed18819f3db24b-000001"
}
],
"type": "index_shard_restore_failed_exception",
"reason": "failed to read shard snapshot file",
"index_uuid": "210psLqQS6mNFZN-AzAhYA",
"shard": "5",
"index": "audit_qedshared_5519fbef53402c50e10080000acd0255_aa50565206214c59beed18819f3db24b-000001",
"caused_by": {
"type": "i_o_exception",
"reason": "org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=958125580 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(ChecksumBlobStoreFormat.readBlob(blob="snap-eK0kNVeHSPm91PsWzrXbrg.dat")))",
"caused_by": {
"type": "corrupt_index_exception",
"reason": "codec footer mismatch (file truncated?): actual footer=958125580 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(ChecksumBlobStoreFormat.readBlob(blob="snap-eK0kNVeHSPm91PsWzrXbrg.dat")))"
}
}
},
"status": 500
}

this index looks healthy though
any pointers ?

spinscale · June 10, 2019, 1:26pm

Please take your time to properly format your messages with markdown, this makes them 10x easier to read. Thanks!

It seems that a file has been corrupted on the snapshot disk or has not been properly written. Is this happening with all of your snapshots? Also what Elasticsearch version are you using and what type of snapshot repo is this?

saikettewary · June 10, 2019, 1:56pm

Yes, This is happening with all my snapshots. ES version is 5.1.2 and its a Shared File System Repository. During the snapshot process we don't see any error/warnings . But when we issue delete snapshot , we see "codec footer mismatch (file truncated?)" in the ES log.

spinscale · June 10, 2019, 2:15pm

do you have any chance to try this with the latest 5.x release? I quick github PR search uncovered a few snapshot/restore issues being fixed.

Also can you possibly snapshot to a different filesystem to make sure there is no harddisk corruption?

saikettewary · June 10, 2019, 4:34pm

We can not change the ES version currently .... but will try to take a snapshot to a different filesystem.

Question about the "codec footer mismatch (file truncated?)" , I have noticed that all these warning are coming from specific Index and shard. Do you think it could be something with the index ? But we can do search on this index just fine though.

system · July 8, 2019, 4:34pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
High CPU utilization on master node Elasticsearch	27	2519	January 6, 2022
During Snapshot Process in ES 5.1.2 Elasticsearch	3	423	June 27, 2019
Does snapshot restore lead to a memory leak? Elasticsearch	9	741	July 6, 2017
Snapshot Deletion Takes a very long time Elasticsearch	11	2130	August 30, 2020
Snapshot not getting started Elasticsearch	2	353	June 26, 2018

Snapshot thread are in sleep

Related topics