High CPU utilization on master node

guptaparv-rlr · November 24, 2021, 5:22pm

I'm operating a Elasticsearch cluster using ECK Operator.

Recently, we have observed high CPU utilization on master nodes. Here is the topology of the cluster:

3 master nodes
3 data nodes

On running the nodes hot threads API for information, I'm seeing azure_event_loop and snapshot in the output. I have disabled snapshots through SLM temporarily, and it CPU utilization has come down. But I still see snapshot and azure repository in the output for hot threads.

Why is it still calling those methods/functions? We need to enable snapshots again, but skeptical if the CPU will be high again on enabling SLM policy.

::: {eck-es-master-2}{y5ikZZu6RVOZAnKW9Xmo-Q}{E30cyqgKTN2yHHahrTQ44Q}{10.23.2.11}{10.23.2.11:9300}{m}{k8s_node_name=, xpack.installed=true, transform.node=false}
   Hot threads at 2021-11-24T17:17:26.436Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
    0.6% (2.9ms out of 500ms) cpu usage by thread 'elasticsearch[eck-es-master-2][azure_event_loop][T#1]'
     2/10 snapshots sharing following 14 elements
       io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
       io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
       org.elasticsearch.repositories.azure.executors.PrivilegedExecutor$$Lambda$6337/0x0000000801b37770.executeCouldThrow(Unknown Source)
       org.elasticsearch.repositories.azure.SocketAccess.lambda$doPrivilegedVoidException$0(SocketAccess.java:46)
       org.elasticsearch.repositories.azure.SocketAccess$$Lambda$6269/0x0000000801af60e8.run(Unknown Source)
       java.base@16/java.security.AccessController.executePrivileged(AccessController.java:784)
       java.base@16/java.security.AccessController.doPrivileged(AccessController.java:554)
       org.elasticsearch.repositories.azure.SocketAccess.doPrivilegedVoidException(SocketAccess.java:45)
       org.elasticsearch.repositories.azure.executors.PrivilegedExecutor.lambda$execute$0(PrivilegedExecutor.java:27)
       org.elasticsearch.repositories.azure.executors.PrivilegedExecutor$$Lambda$6336/0x0000000801b36ee8.run(Unknown Source)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:673)
       java.base@16/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.base@16/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@16/java.lang.Thread.run(Thread.java:831)
   
    0.2% (1.1ms out of 500ms) cpu usage by thread 'elasticsearch[eck-es-master-2][snapshot][T#1]'
     10/10 snapshots sharing following 25 elements
       java.base@16/jdk.internal.misc.Unsafe.park(Native Method)
       java.base@16/java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
       java.base@16/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:714)
       java.base@16/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1046)
       java.base@16/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:232)
       reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:87)
       reactor.core.publisher.Flux.blockLast(Flux.java:2497)
       com.azure.core.util.paging.ContinuablePagedByIteratorBase.requestPage(ContinuablePagedByIteratorBase.java:94)
       com.azure.core.util.paging.ContinuablePagedByIteratorBase.hasNext(ContinuablePagedByIteratorBase.java:53)
       org.elasticsearch.repositories.azure.AzureBlobStore.lambda$listBlobsByPrefix$15(AzureBlobStore.java:330)
       org.elasticsearch.repositories.azure.AzureBlobStore$$Lambda$6401/0x0000000801bb9528.executeCouldThrow(Unknown Source)
       org.elasticsearch.repositories.azure.SocketAccess.lambda$doPrivilegedVoidException$0(SocketAccess.java:46)
       org.elasticsearch.repositories.azure.SocketAccess$$Lambda$6269/0x0000000801af60e8.run(Unknown Source)
       java.base@16/java.security.AccessController.executePrivileged(AccessController.java:784)
       java.base@16/java.security.AccessController.doPrivileged(AccessController.java:554)
       org.elasticsearch.repositories.azure.SocketAccess.doPrivilegedVoidException(SocketAccess.java:45)
       org.elasticsearch.repositories.azure.AzureBlobStore.listBlobsByPrefix(AzureBlobStore.java:323)
       org.elasticsearch.repositories.azure.AzureBlobContainer.listBlobsByPrefix(AzureBlobContainer.java:120)
       org.elasticsearch.repositories.azure.AzureBlobContainer.listBlobs(AzureBlobContainer.java:126)
       app//org.elasticsearch.repositories.blobstore.BlobStoreRepository$3.doRun(BlobStoreRepository.java:872)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       java.base@16/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.base@16/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@16/java.lang.Thread.run(Thread.java:831)

DavidTurner · November 24, 2021, 5:30pm

Hi @guptaparv-rlr , what version are you using? There's been some improvements to Azure repo handling in recent versions.

guptaparv-rlr · November 25, 2021, 6:14am

Elasticsearch version: 7.13.2
ECK version: elastic-cloud-eck.v1.7.1

And Azure plugin is installed as an init container.

initContainers:
- name: configure-plugins
  command: ["sh", "-c", "bin/elasticsearch-plugin install --batch repository-azure"]

And the snapshots have been taking too long.

Data in the cluster is not much either. Here is the output of GET /_cat/indices?v&pretty&s=index

health status index                           uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   .apm-agent-configuration        6o2soBFRQP6X4niBsZR2qA   1   1          0            0       522b           261b
green  open   .apm-custom-link                slF1w3pTQw2acSD6MCLIZA   1   1          0            0       522b           261b
green  open   .async-search                   _0QMH_tMTESjvVHbrcHKzQ   1   1          0            0       518b           259b
green  open   .kibana-event-log-7.13.2-000001 7iDQAdYPTyW_QNr6bzgBNg   1   1          2            0       22kb           11kb
green  open   .kibana-event-log-7.13.2-000002 lE1tkgCyTOGZtLoa7LynMw   1   1          1            0     11.2kb          5.6kb
green  open   .kibana-event-log-7.13.2-000003 08b3DaSmS-ORhPdlPjIBoA   1   1          1            0     11.2kb          5.6kb
green  open   .kibana_7.13.2_001              CM6KvtarQcW_kXb2fwWtGQ   1   1         74           15      4.2mb          2.1mb
green  open   .kibana_task_manager_7.13.2_001 fdS80QPsRQS9b8TVUhk6Qg   1   1         11        17074      3.3mb          1.6mb
green  open   .security-7                     19-PAfNySbqmW0m7KUgBFw   1   1         53            3    374.1kb          187kb
green  open   .tasks                          HNozXHx_S9KTopNBb8t8lQ   1   1          6            0     58.2kb           35kb
green  open   device_logs-000011              _flw7a2GTRiQOLS-vu9xIA   1   1     304094            0    112.2mb         56.1mb
green  open   device_logs-000012              W0zugfPlTpqhSauoGriKGg   1   1     406199            0    166.2mb         83.1mb
green  open   device_logs-000013              FCceQTqYRIqGQY6VbeXWPg   1   1     333555            0    138.5mb         69.2mb
green  open   device_logs-000014              PJtbji_YRrukUeaxda_Yog   1   1     519957            0    202.4mb        101.2mb
green  open   device_logs-000015              WBOgtjPKRuuyaVaS33H-dA   1   1     454628            0    196.4mb         98.2mb
green  open   device_logs-000016              Y0GwLAuzQKCk-GrJCOC5WA   1   1     839464            0    231.5mb        115.7mb
green  open   device_logs-000017              P2cBIHo9TgW1ziJssswDtw   1   1     342169            0    147.7mb         73.8mb
green  open   device_logs-000018              H9X4TXcVR_65zKcd1Yix-Q   1   1     652004            0    361.7mb        180.8mb
green  open   device_logs-000019              5Lm9cpyMT2WOKcgJzr8ILQ   1   1     619929            0    251.9mb        125.9mb
green  open   device_logs-000020              1eWj5dxNTXunaJ49D_3EUA   1   1    1117321            0    363.5mb        181.7mb
green  open   device_logs-000021              tEM5A31tS7iYLo_iO18K_Q   1   1     466726            0    192.7mb         96.3mb
green  open   device_logs-000022              s0xmeD65R5qCOr7wSGkFlg   1   1     518821            0    233.4mb        116.7mb
green  open   device_logs-000023              b2sdsgKxQ2iroW3NXuIExQ   1   1    2043565            0    615.4mb        307.7mb
green  open   device_logs-000024              v0XlABPkRg-uaDQ3BUftKg   1   1     471344            0    213.2mb        106.8mb
green  open   device_logs-000025              -xGaPw9HSvOFlgTguxu7Ew   1   1    1649097            0    784.3mb        362.4mb

DavidTurner · November 25, 2021, 11:38am

Ok the big Azure upgrade was in 7.11, although there was another one in 7.14 that you're currently missing.

However the hot threads output you shared looks fine: the hottest thread is only using 0.6% CPU. What evidence are you seeing that Elasticsearch is the thing that's consuming all your CPU?

guptaparv-rlr · November 25, 2021, 1:04pm

Thanks @DavidTurner for the release notes. I will check and see if the upgrade helps and report back.

On the high CPU usage, the master nodes(especially the elected one) started crossing the CPU limits(1500m) set on the pod. This would cause the pod to restart and then it will happen to some other pod later on.

Unfortunately, I didn't check the hot threads before disabling the SLM policy. But post disabling SLM, the CPU usage is super minimal, only 20m to 50m.

Since, we don't have the data when it was high, my suspicion is the snapshot was taking too long and causing the high CPU. If you see the duration of snapshots, it is very high. The last snapshot took 11+ hours. And SLM cron is set to every 8 hours.

Is snapshot process heavy on CPU?

DavidTurner · November 25, 2021, 1:10pm

Not usually, but maybe there's something special about your case. I don't think there's much we can do to help without seeing outputs (hot threads etc) from the cluster while it's experiencing the problem you describe.

guptaparv-rlr · December 2, 2021, 9:39am

Hi @DavidTurner, as per your suggestion I upgraded the cluster to 7.14. But I'm not seeing any improvements in the snapshot duration. Though I don't see the CPU spike a lot. Only a little on the data nodes. Maximum of 0.4m.

I'm attaching a few screenshots for CPU consumption on both master and data nodes
Data node:

Master node:

But the snapshot duration is still more than 10 minutes

Attaching some hot threads response

::: {eck-es-data-1}{Um5J-w5fSgemzzKwhzC66g}{5aQ4zpfgRoiMb90q71TGow}{10.21.1.39}{10.21.1.39:9300}{di}{k8s_node_name=aks-es-17300195-vmss00000v, xpack.installed=true, transform.node=false}
   Hot threads at 2021-12-02T09:05:01.706Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
    8.1% (40.4ms out of 500ms) cpu usage by thread 'elasticsearch[eck-es-data-1][snapshot][T#1]'
     10/10 snapshots sharing following 18 elements
       org.elasticsearch.repositories.azure.AzureBlobStore$$Lambda$7884/0x0000000801e0b408.executeCouldThrow(Unknown Source)
       org.elasticsearch.repositories.azure.SocketAccess.lambda$doPrivilegedVoidException$0(SocketAccess.java:46)
       org.elasticsearch.repositories.azure.SocketAccess$$Lambda$7762/0x0000000801d67270.run(Unknown Source)
       java.base@16.0.1/java.security.AccessController.executePrivileged(AccessController.java:784)
       java.base@16.0.1/java.security.AccessController.doPrivileged(AccessController.java:554)
       org.elasticsearch.repositories.azure.SocketAccess.doPrivilegedVoidException(SocketAccess.java:45)
       org.elasticsearch.repositories.azure.AzureBlobStore.executeSingleUpload(AzureBlobStore.java:463)
       org.elasticsearch.repositories.azure.AzureBlobStore.writeBlob(AzureBlobStore.java:445)
       org.elasticsearch.repositories.azure.AzureBlobContainer.writeBlob(AzureBlobContainer.java:91)
       app//org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotFile(BlobStoreRepository.java:3357)
       app//org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$executeOneFileSnapshot$78(BlobStoreRepository.java:2845)
       app//org.elasticsearch.repositories.blobstore.BlobStoreRepository$$Lambda$7906/0x0000000801e1c000.accept(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       java.base@16.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.base@16.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@16.0.1/java.lang.Thread.run(Thread.java:831)
   
    5.8% (29ms out of 500ms) cpu usage by thread 'elasticsearch[eck-es-data-1][azure_event_loop][T#1]'
     10/10 snapshots sharing following 14 elements
       io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
       io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
       org.elasticsearch.repositories.azure.executors.PrivilegedExecutor$$Lambda$7830/0x0000000801dadb60.executeCouldThrow(Unknown Source)
       org.elasticsearch.repositories.azure.SocketAccess.lambda$doPrivilegedVoidException$0(SocketAccess.java:46)
       org.elasticsearch.repositories.azure.SocketAccess$$Lambda$7762/0x0000000801d67270.run(Unknown Source)
       java.base@16.0.1/java.security.AccessController.executePrivileged(AccessController.java:784)
       java.base@16.0.1/java.security.AccessController.doPrivileged(AccessController.java:554)
       org.elasticsearch.repositories.azure.SocketAccess.doPrivilegedVoidException(SocketAccess.java:45)
       org.elasticsearch.repositories.azure.executors.PrivilegedExecutor.lambda$execute$0(PrivilegedExecutor.java:27)
       org.elasticsearch.repositories.azure.executors.PrivilegedExecutor$$Lambda$7829/0x0000000801dad2d8.run(Unknown Source)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:673)
       java.base@16.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.base@16.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@16.0.1/java.lang.Thread.run(Thread.java:831)
   
    4.8% (24ms out of 500ms) cpu usage by thread 'elasticsearch[eck-es-data-1][snapshot][T#4]'
     9/10 snapshots sharing following 26 elements
       java.base@16.0.1/jdk.internal.misc.Unsafe.park(Native Method)
       java.base@16.0.1/java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
       java.base@16.0.1/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:714)
       java.base@16.0.1/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1046)
       java.base@16.0.1/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:232)
       reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:87)
       reactor.core.publisher.Mono.block(Mono.java:1680)
       org.elasticsearch.repositories.azure.AzureBlobStore.lambda$executeSingleUpload$19(AzureBlobStore.java:475)
       org.elasticsearch.repositories.azure.AzureBlobStore$$Lambda$7884/0x0000000801e0b408.executeCouldThrow(Unknown Source)
       org.elasticsearch.repositories.azure.SocketAccess.lambda$doPrivilegedVoidException$0(SocketAccess.java:46)
       org.elasticsearch.repositories.azure.SocketAccess$$Lambda$7762/0x0000000801d67270.run(Unknown Source)
       java.base@16.0.1/java.security.AccessController.executePrivileged(AccessController.java:784)
       java.base@16.0.1/java.security.AccessController.doPrivileged(AccessController.java:554)
       org.elasticsearch.repositories.azure.SocketAccess.doPrivilegedVoidException(SocketAccess.java:45)
       org.elasticsearch.repositories.azure.AzureBlobStore.executeSingleUpload(AzureBlobStore.java:463)
       org.elasticsearch.repositories.azure.AzureBlobStore.writeBlob(AzureBlobStore.java:445)
       org.elasticsearch.repositories.azure.AzureBlobContainer.writeBlob(AzureBlobContainer.java:91)
       app//org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotFile(BlobStoreRepository.java:3357)
       app//org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$executeOneFileSnapshot$78(BlobStoreRepository.java:2845)
       app//org.elasticsearch.repositories.blobstore.BlobStoreRepository$$Lambda$7906/0x0000000801e1c000.accept(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       java.base@16.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.base@16.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@16.0.1/java.lang.Thread.run(Thread.java:831)
     unique snapshot
       java.base@16.0.1/java.lang.StringLatin1.hashCode(StringLatin1.java:196)
       java.base@16.0.1/java.lang.String.hashCode(String.java:1542)
       java.base@16.0.1/java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936)
       java.base@16.0.1/java.security.BasicPermissionCollection.implies(BasicPermission.java:424)
       java.base@16.0.1/java.security.Permissions.implies(Permissions.java:177)
       java.base@16.0.1/sun.security.provider.PolicyFile.implies(PolicyFile.java:994)
       java.base@16.0.1/sun.security.provider.PolicySpiFile.engineImplies(PolicySpiFile.java:75)
       java.base@16.0.1/java.security.Policy$PolicyDelegate.implies(Policy.java:801)
       app//org.elasticsearch.bootstrap.ESPolicy.implies(ESPolicy.java:75)
       java.base@16.0.1/java.security.ProtectionDomain.implies(ProtectionDomain.java:321)
       java.base@16.0.1/java.security.ProtectionDomain.impliesWithAltFilePerm(ProtectionDomain.java:353)
       java.base@16.0.1/java.security.AccessControlContext.checkPermission(AccessControlContext.java:450)
       java.base@16.0.1/java.security.AccessController.checkPermission(AccessController.java:1036)
       java.base@16.0.1/java.lang.SecurityManager.checkPermission(SecurityManager.java:408)
       java.base@16.0.1/java.lang.System.getenv(System.java:1016)
       com.azure.core.util.Configuration.loadFromEnvironment(Configuration.java:260)
       com.azure.core.util.Configuration.load(Configuration.java:256)
       com.azure.core.util.Configuration.getOrLoad(Configuration.java:232)
       com.azure.core.util.Configuration.get(Configuration.java:182)
       com.azure.core.implementation.logging.DefaultLogger.<init>(DefaultLogger.java:67)
       com.azure.core.util.logging.ClientLogger.<init>(ClientLogger.java:59)
       com.azure.core.util.logging.ClientLogger.<init>(ClientLogger.java:48)
       com.azure.core.http.rest.RestProxy.<init>(RestProxy.java:66)
       com.azure.core.http.rest.RestProxy.create(RestProxy.java:645)
       com.azure.core.http.rest.RestProxy.create(RestProxy.java:630)
       com.azure.storage.blob.implementation.PageBlobsImpl.<init>(PageBlobsImpl.java:68)
       com.azure.storage.blob.implementation.AzureBlobStorageImpl.<init>(AzureBlobStorageImpl.java:220)
       com.azure.storage.blob.implementation.AzureBlobStorageBuilder.build(AzureBlobStorageBuilder.java:93)
       com.azure.storage.blob.specialized.BlobAsyncClientBase.<init>(BlobAsyncClientBase.java:201)
       com.azure.storage.blob.specialized.BlobAsyncClientBase.<init>(BlobAsyncClientBase.java:163)
       com.azure.storage.blob.BlobAsyncClient.<init>(BlobAsyncClient.java:136)
       com.azure.storage.blob.BlobContainerAsyncClient.getBlobAsyncClient(BlobContainerAsyncClient.java:171)
       com.azure.storage.blob.BlobContainerAsyncClient.getBlobAsyncClient(BlobContainerAsyncClient.java:152)
       org.elasticsearch.repositories.azure.AzureBlobStore.lambda$executeSingleUpload$19(AzureBlobStore.java:466)
       org.elasticsearch.repositories.azure.AzureBlobStore$$Lambda$7884/0x0000000801e0b408.executeCouldThrow(Unknown Source)
       org.elasticsearch.repositories.azure.SocketAccess.lambda$doPrivilegedVoidException$0(SocketAccess.java:46)
       org.elasticsearch.repositories.azure.SocketAccess$$Lambda$7762/0x0000000801d67270.run(Unknown Source)
       java.base@16.0.1/java.security.AccessController.executePrivileged(AccessController.java:784)
       java.base@16.0.1/java.security.AccessController.doPrivileged(AccessController.java:554)
       org.elasticsearch.repositories.azure.SocketAccess.doPrivilegedVoidException(SocketAccess.java:45)
       org.elasticsearch.repositories.azure.AzureBlobStore.executeSingleUpload(AzureBlobStore.java:463)
       org.elasticsearch.repositories.azure.AzureBlobStore.writeBlob(AzureBlobStore.java:445)
       org.elasticsearch.repositories.azure.AzureBlobContainer.writeBlob(AzureBlobContainer.java:91)
       app//org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotFile(BlobStoreRepository.java:3357)
       app//org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$executeOneFileSnapshot$78(BlobStoreRepository.java:2845)
       app//org.elasticsearch.repositories.blobstore.BlobStoreRepository$$Lambda$7906/0x0000000801e1c000.accept(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       java.base@16.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.base@16.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@16.0.1/java.lang.Thread.run(Thread.java:831)

DavidTurner · December 2, 2021, 10:04am

The hottest thread is consuming 8.1% CPU which seems pretty low to me. 20 minutes to complete a daily snapshot seems ok too. Could you help me understand what problem you're trying to solve?

guptaparv-rlr · December 2, 2021, 10:52am

Initially, it was high CPU utilization during snapshots. But looks like it is stable right now.

My other question is along with the duration of the snapshot. We don't have a lot of data in the cluster. So 20 minutes seems too long.

Here are the indices which are getting backed up. And these numbers are for a non-prod environment. On production it is way too high, 11 to 15 hours(with less than 500Mb of data in total).

health status index                           uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   device_logs-000001              MPBeXxMsT8mqzKMt0VuWlQ   1   1      52654            0     25.2mb         12.6mb
green  open   device_logs-000002              ziE4DgNrTH2pivVOVKmKog   1   1      52381            0     24.8mb         12.4mb
green  open   device_logs-000003              8cpnS6rlThCCa8kF7V0GtQ   1   1      53019            0     25.9mb         12.9mb
green  open   device_logs-000004              fxsEesBfRm2N0pZM62072w   1   1      39634            0     19.3mb          9.6mb
green  open   device_logs-000005              dYFW31wXQYqFV7t8K-0c9w   1   1      37060            0     17.6mb          8.8mb
green  open   device_logs-000006              z_jkTiLUQw6OfBTHyiWvqQ   1   1      37171            0     17.7mb          8.8mb
green  open   device_logs-000007              rwplFiCBQte58FeIBnVuRw   1   1      37219            0     17.7mb          8.8mb
green  open   device_logs-000008              KC66nGeoQiGxSnjadmeI4Q   1   1      37287            0     17.8mb          8.8mb
green  open   device_logs-000009              cYtO7kSiT3igvkCg8uyqtQ   1   1      26286            0     15.8mb          8.1mb

DavidTurner · December 2, 2021, 11:00am

Ok 20 minutes to snapshot ~300MB is pretty slow, but I expect the limiting factor is just how fast you can interact with the repository and not anything in Elasticsearch itself. Try the Repository analysis API to get a better understanding of its performance characteristics.

guptaparv-rlr · December 2, 2021, 11:15am

Here is the response from the repository analysis API.

POST http://{{hostname}}.{{env}}/_snapshot/azurebackup/_analyze?blob_count=10&max_blob_size=100mb&timeout=120s

{
    "coordinating_node": {
        "id": "Um5J-w5fSgemzzKwhzC66g",
        "name": "eck-device-logs-es-data-1"
    },
    "repository": "azurebackup",
    "blob_count": 10,
    "concurrency": 10,
    "read_node_count": 10,
    "early_read_node_count": 2,
    "max_blob_size": "100mb",
    "max_blob_size_bytes": 104857600,
    "max_total_data_size": "1gb",
    "max_total_data_size_bytes": 1073741824,
    "seed": 2400969067,
    "rare_action_probability": 0.02,
    "blob_path": "temp-analysis-KO9dXHXTS8-K5khePouExA",
    "issues_detected": [],
    "summary": {
        "write": {
            "count": 10,
            "total_size": "327.5mb",
            "total_size_bytes": 343408640,
            "total_throttled": "3.25s",
            "total_throttled_nanos": 3259324604,
            "total_elapsed": "9.26s",
            "total_elapsed_nanos": 9262147182
        },
        "read": {
            "count": 52,
            "total_size": "760.2mb",
            "total_size_bytes": 797211417,
            "total_wait": "369.91micros",
            "total_wait_nanos": 369913,
            "max_wait": "46.5micros",
            "max_wait_nanos": 46503,
            "total_throttled": "19.21s",
            "total_throttled_nanos": 19210060779,
            "total_elapsed": "22.2s",
            "total_elapsed_nanos": 22207198251
        }
    },
    "listing_elapsed": "107.46ms",
    "listing_elapsed_nanos": 107465374,
    "delete_elapsed": "159.25ms",
    "delete_elapsed_nanos": 159251561
}

DavidTurner · December 2, 2021, 11:42am

Hmm that looks reasonable (although blob_count=10&max_blob_size=100mb is a very gentle test). Maybe try setting logger.org.elasticsearch.repositories: TRACE and logger.org.elasticsearch.snapshots: TRACE, I think this will give a complete picture of the snapshot process.

guptaparv-rlr · December 2, 2021, 12:18pm

Hope this helps. Or if you can provide your email address to send the full log.

{"type": "server", "timestamp": "2021-12-02T12:14:10,047Z", "level": "TRACE", "component": "o.e.r.b.t.BlobAnalyzeAction", "cluster.name": "eck", "node.name": "eck-es-master-1", "message": "handling [BlobAnalyzeAction.Request{blob analysis [azurebackup:temp-analysis-ahQf8y7fQ42sroQATCdTGQ/test-blob-4-wlJqMGoBSvWR41js-2QpFg, length=67108864, seed=7609597446298085786, readEarly=false, writeAndOverwrite=false, abortWrite=true]}]", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "yHehmN5jSlK7_YWh2P3eOA"  }
{"type": "server", "timestamp": "2021-12-02T12:14:10,047Z", "level": "TRACE", "component": "o.e.r.b.t.BlobAnalyzeAction", "cluster.name": "eck", "node.name": "eck-es-master-1", "message": "writing blob [atomic=true, failIfExists=true] for [blob analysis [azurebackup:temp-analysis-ahQf8y7fQ42sroQATCdTGQ/test-blob-4-wlJqMGoBSvWR41js-2QpFg, length=67108864, seed=7609597446298085786, readEarly=false, writeAndOverwrite=false, abortWrite=true]]", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "yHehmN5jSlK7_YWh2P3eOA"  }
{"type": "server", "timestamp": "2021-12-02T12:14:10,049Z", "level": "TRACE", "component": "o.e.r.b.t.BlobAnalyzeAction", "cluster.name": "eck", "node.name": "eck-es-master-1", "message": "finished writing blob for [blob analysis [azurebackup:temp-analysis-ahQf8y7fQ42sroQATCdTGQ/test-blob-4-wlJqMGoBSvWR41js-2QpFg, length=67108864, seed=7609597446298085786, readEarly=false, writeAndOverwrite=false, abortWrite=true]], got checksum [3530970182]", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "yHehmN5jSlK7_YWh2P3eOA"  }
{"type": "server", "timestamp": "2021-12-02T12:14:10,049Z", "level": "TRACE", "component": "o.e.r.b.t.BlobAnalyzeAction", "cluster.name": "eck", "node.name": "eck-es-master-1", "message": "sending read request to [[{eck-es-master-2}{Y7eRFeJkSi2X6gNkHO13Fg}{dR6P9AvzQzWal3zG8S23BQ}{10.21.1.29}{10.21.1.29:9300}{m}{k8s_node_name=aks-es-17300195-vmss00000y, xpack.installed=true, transform.node=false}, {eck-es-master-1}{yHehmN5jSlK7_YWh2P3eOA}{f40S1Q00RBOUzDxue7HkIQ}{10.21.1.185}{10.21.1.185:9300}{m}{k8s_node_name=aks-es-17300195-vmss00000z, xpack.installed=true, transform.node=false}, {eck-es-data-0}{DSHuZnJ2SsuFX50M-wTsyQ}{x-2PbTFnRKqE60aHqMifEw}{10.21.0.183}{10.21.0.183:9300}{di}{k8s_node_name=aks-es-17300195-vmss00000s, xpack.installed=true, transform.node=false}, {eck-es-master-0}{NqnAlbmwQ6KmbkC0DEfdFw}{i1VnXIO9TcCXljUMhYbjzQ}{10.21.0.253}{10.21.0.253:9300}{m}{k8s_node_name=aks-es-17300195-vmss000010, xpack.installed=true, transform.node=false}, {eck-es-data-1}{Um5J-w5fSgemzzKwhzC66g}{5aQ4zpfgRoiMb90q71TGow}{10.21.1.39}{10.21.1.39:9300}{di}{k8s_node_name=aks-es-17300195-vmss00000v, xpack.installed=true, transform.node=false}]] for [blob analysis [azurebackup:temp-analysis-ahQf8y7fQ42sroQATCdTGQ/test-blob-4-wlJqMGoBSvWR41js-2QpFg, length=67108864, seed=7609597446298085786, readEarly=false, writeAndOverwrite=false, abortWrite=true]] after write complete", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "yHehmN5jSlK7_YWh2P3eOA"  }
{"type": "server", "timestamp": "2021-12-02T12:14:10,050Z", "level": "TRACE", "component": "o.e.r.b.t.BlobAnalyzeAction", "cluster.name": "eck", "node.name": "eck-es-master-1", "message": "handling [BlobAnalyzeAction.Request{blob analysis [azurebackup:temp-analysis-ahQf8y7fQ42sroQATCdTGQ/test-blob-9-GbbDBdVqS3uzxTAwoVU2qA, length=16777216, seed=4248370547091748351, readEarly=false, writeAndOverwrite=false, abortWrite=false]}]", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "yHehmN5jSlK7_YWh2P3eOA"  }
{"type": "server", "timestamp": "2021-12-02T12:14:10,050Z", "level": "TRACE", "component": "o.e.r.b.t.BlobAnalyzeAction", "cluster.name": "eck", "node.name": "eck-es-master-1", "message": "writing blob [atomic=false, failIfExists=true] for [blob analysis [azurebackup:temp-analysis-ahQf8y7fQ42sroQATCdTGQ/test-blob-9-GbbDBdVqS3uzxTAwoVU2qA, length=16777216, seed=4248370547091748351, readEarly=false, writeAndOverwrite=false, abortWrite=false]]", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "yHehmN5jSlK7_YWh2P3eOA"  }
{"type": "server", "timestamp": "2021-12-02T12:14:10,061Z", "level": "TRACE", "component": "o.e.r.b.t.BlobAnalyzeAction", "cluster.name": "eck", "node.name": "eck-es-master-1", "message": "received read response [GetRepositoryBlobChecksumResponse{bytesRead=0, checksum=0, firstByteNanos=0, elapsedNanos=0, throttleNanos=0}] from [{eck-es-data-1}{Um5J-w5fSgemzzKwhzC66g}{5aQ4zpfgRoiMb90q71TGow}{10.21.1.39}{10.21.1.39:9300}{di}{k8s_node_name=aks-es-17300195-vmss00000v, xpack.installed=true, transform.node=false}] for [blob analysis [azurebackup:temp-analysis-ahQf8y7fQ42sroQATCdTGQ/test-blob-4-wlJqMGoBSvWR41js-2QpFg, length=67108864, seed=7609597446298085786, readEarly=false, writeAndOverwrite=false, abortWrite=true]] [beforeWriteComplete=false]", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "yHehmN5jSlK7_YWh2P3eOA"  }
{"type": "server", "timestamp": "2021-12-02T12:14:10,062Z", "level": "TRACE", "component": "o.e.r.b.t.BlobAnalyzeAction", "cluster.name": "eck", "node.name": "eck-es-master-1", "message": "received read response [GetRepositoryBlobChecksumResponse{bytesRead=65100, checksum=514628367, firstByteNanos=21801, elapsedNanos=417311, throttleNanos=0}] from [{eck-es-data-1}{Um5J-w5fSgemzzKwhzC66g}{5aQ4zpfgRoiMb90q71TGow}{10.21.1.39}{10.21.1.39:9300}{di}{k8s_node_name=aks-es-17300195-vmss00000v, xpack.installed=true, transform.node=false}] for [blob analysis [azurebackup:temp-analysis-ahQf8y7fQ42sroQATCdTGQ/test-blob-1--I3uT29nS2yJlT7shgBZzg, length=4194304, seed=-8257268685809685692, readEarly=false, writeAndOverwrite=false, abortWrite=false]] [beforeWriteComplete=false]", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "yHehmN5jSlK7_YWh2P3eOA"  }
{"type": "server", "timestamp": "2021-12-02T12:14:10,063Z", "level": "TRACE", "component": "o.e.r.b.t.BlobAnalyzeAction", "cluster.name": "eck", "node.name": "eck-es-master-1", "message": "received read response [GetRepositoryBlobChecksumResponse{bytesRead=65100, checksum=514628367, firstByteNanos=2800, elapsedNanos=123308, throttleNanos=0}] from [{eck-es-data-0}{DSHuZnJ2SsuFX50M-wTsyQ}{x-2PbTFnRKqE60aHqMifEw}{10.21.0.183}{10.21.0.183:9300}{di}{k8s_node_name=aks-es-17300195-vmss00000s, xpack.installed=true, transform.node=false}] for [blob analysis [azurebackup:temp-analysis-ahQf8y7fQ42sroQATCdTGQ/test-blob-1--I3uT29nS2yJlT7shgBZzg, length=4194304, seed=-8257268685809685692, readEarly=false, writeAndOverwrite=false, abortWrite=false]] [beforeWriteComplete=false]", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "yHehmN5jSlK7_YWh2P3eOA"  }
{"type": "server", "timestamp": "2021-12-02T12:14:10,073Z", "level": "TRACE", "component": "o.e.r.b.t.BlobAnalyzeAction", "cluster.name": "eck", "node.name": "eck-es-master-1", "message": "received read response [GetRepositoryBlobChecksumResponse{bytesRead=0, checksum=0, firstByteNanos=0, elapsedNanos=0, throttleNanos=0}] from [{eck-es-data-0}{DSHuZnJ2SsuFX50M-wTsyQ}{x-2PbTFnRKqE60aHqMifEw}{10.21.0.183}{10.21.0.183:9300}{di}{k8s_node_name=aks-es-17300195-vmss00000s, xpack.installed=true, transform.node=false}] for [blob analysis [azurebackup:temp-analysis-ahQf8y7fQ42sroQATCdTGQ/test-blob-4-wlJqMGoBSvWR41js-2QpFg, length=67108864, seed=7609597446298085786, readEarly=false, writeAndOverwrite=false, abortWrite=true]] [beforeWriteComplete=false]", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "yHehmN5jSlK7_YWh2P3eOA"  }
{"type": "server", "timestamp": "2021-12-02T12:14:10,333Z", "level": "TRACE", "component": "o.e.r.b.t.BlobAnalyzeAction", "cluster.name": "eck", "node.name": "eck-es-master-1", "message": "finished writing blob for [blob analysis [azurebackup:temp-analysis-ahQf8y7fQ42sroQATCdTGQ/test-blob-9-GbbDBdVqS3uzxTAwoVU2qA, length=16777216, seed=4248370547091748351, readEarly=false, writeAndOverwrite=false, abortWrite=false]], got checksum [4119483540]", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "yHehmN5jSlK7_YWh2P3eOA"  }
{"type": "server", "timestamp": "2021-12-02T12:14:10,334Z", "level": "TRACE", "component": "o.e.r.b.t.BlobAnalyzeAction", "cluster.name": "eck", "node.name": "eck-es-master-1", "message": "sending read request to [[{eck-es-data-0}{DSHuZnJ2SsuFX50M-wTsyQ}{x-2PbTFnRKqE60aHqMifEw}{10.21.0.183}{10.21.0.183:9300}{di}{k8s_node_name=aks-es-17300195-vmss00000s, xpack.installed=true, transform.node=false}, {eck-es-master-1}{yHehmN5jSlK7_YWh2P3eOA}{f40S1Q00RBOUzDxue7HkIQ}{10.21.1.185}{10.21.1.185:9300}{m}{k8s_node_name=aks-es-17300195-vmss00000z, xpack.installed=true, transform.node=false}, {eck-es-data-1}{Um5J-w5fSgemzzKwhzC66g}{5aQ4zpfgRoiMb90q71TGow}{10.21.1.39}{10.21.1.39:9300}{di}{k8s_node_name=aks-es-17300195-vmss00000v, xpack.installed=true, transform.node=false}, {eck-es-master-2}{Y7eRFeJkSi2X6gNkHO13Fg}{dR6P9AvzQzWal3zG8S23BQ}{10.21.1.29}{10.21.1.29:9300}{m}{k8s_node_name=aks-es-17300195-vmss00000y, xpack.installed=true, transform.node=false}, {eck-es-master-0}{NqnAlbmwQ6KmbkC0DEfdFw}{i1VnXIO9TcCXljUMhYbjzQ}{10.21.0.253}{10.21.0.253:9300}{m}{k8s_node_name=aks-es-17300195-vmss000010, xpack.installed=true, transform.node=false}]] for [blob analysis [azurebackup:temp-analysis-ahQf8y7fQ42sroQATCdTGQ/test-blob-9-GbbDBdVqS3uzxTAwoVU2qA, length=16777216, seed=4248370547091748351, readEarly=false, writeAndOverwrite=false, abortWrite=false]] after write complete", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "yHehmN5jSlK7_YWh2P3eOA"  }
{"type": "server", "timestamp": "2021-12-02T12:14:10,334Z", "level": "TRACE", "component": "o.e.r.b.t.GetBlobChecksumAction", "cluster.name": "eck", "node.name": "eck-es-master-1", "message": "handling [GetRepositoryBlobChecksumRequest{retrieve [337949-375340] from [azurebackup:temp-analysis-ahQf8y7fQ42sroQATCdTGQ/test-blob-0-UMJf4j2fROKC2Dy2l2VYPA]}]", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "yHehmN5jSlK7_YWh2P3eOA"  }
{"type": "server", "timestamp": "2021-12-02T12:14:10,334Z", "level": "TRACE", "component": "o.e.r.a.AzureBlobContainer", "cluster.name": "eck", "node.name": "eck-es-master-1", "message": "readBlob(test-blob-0-UMJf4j2fROKC2Dy2l2VYPA) from position [337949] with length [37391]", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "yHehmN5jSlK7_YWh2P3eOA"  }
{"type": "server", "timestamp": "2021-12-02T12:14:10,334Z", "level": "TRACE", "component": "o.e.r.a.AzureBlobStore", "cluster.name": "eck", "node.name": "eck-es-master-1", "message": "reading container ], blob [temp-analysis-ahQf8y7fQ42sroQATCdTGQ/test-blob-0-UMJf4j2fROKC2Dy2l2VYPA]", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "yHehmN5jSlK7_YWh2P3eOA"  }

DavidTurner · December 2, 2021, 12:22pm

You'll need to share logs for the whole snapshot operation - the repository analysis seems fast. Share them on gist.github.com or similar.

guptaparv-rlr · December 2, 2021, 12:42pm

Here is the gist, this snapshot took 353s

gist.github.com

https://gist.github.com/guptaparv-rlr/f7ce53f87e0d394205fa935011425fa4

master-0.json

{"type": "server", "timestamp": "2021-12-02T12:33:22,511Z", "level": "DEBUG", "component": "o.e.r.b.BlobStoreRepository", "cluster.name": "eck", "node.name": "eck-es-master-0", "message": "Updated repository generation from [216] to [217]", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "NqnAlbmwQ6KmbkC0DEfdFw"  }

master-1.json

{"type": "server", "timestamp": "2021-12-02T12:27:18,447Z", "level": "TRACE", "component": "o.e.r.a.AzureBlobContainer", "cluster.name": "eck", "node.name": "eck-es-master-1", "message": "readBlob(snap-zRtgKX7nReCDdjtpvYv6MQ.dat) from position [0] with length [unlimited]", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "yHehmN5jSlK7_YWh2P3eOA"  }
{"type": "server", "timestamp": "2021-12-02T12:27:18,447Z", "level": "TRACE", "component": "o.e.r.a.AzureBlobStore", "cluster.name": "eck", "node.name": "eck-es-master-1", "message": "reading container [device-logs], blob [snap-zRtgKX7nReCDdjtpvYv6MQ.dat]", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "yHehmN5jSlK7_YWh2P3eOA"  }
{"type": "server", "timestamp": "2021-12-02T12:27:18,447Z", "level": "TRACE", "component": "o.e.r.a.AzureBlobContainer", "cluster.name": "eck", "node.name": "eck-es-master-1", "message": "readBlob(snap-Fg5aeoFVTkmXTPwyFC_LqA.dat) from position [0] with length [unlimited]", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "yHehmN5jSlK7_YWh2P3eOA"  }
{"type": "server", "timestamp": "2021-12-02T12:27:18,447Z", "level": "TRACE", "component": "o.e.r.a.AzureBlobStore", "cluster.name": "eck", "node.name": "eck-es-master-1", "message": "reading container [device-logs], blob [snap-Fg5aeoFVTkmXTPwyFC_LqA.dat]", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "yHehmN5jSlK7_YWh2P3eOA"  }
{"type": "server", "timestamp": "2021-12-02T12:27:18,448Z", "level": "TRACE", "component": "o.e.r.a.AzureBlobContainer", "cluster.name": "eck", "node.name": "eck-es-master-1", "message": "readBlob(snap-0BctTRQiSXioKnCyiGruaw.dat) from position [0] with length [unlimited]", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "yHehmN5jSlK7_YWh2P3eOA"  }
{"type": "server", "timestamp": "2021-12-02T12:27:18,448Z", "level": "TRACE", "component": "o.e.r.a.AzureBlobStore", "cluster.name": "eck", "node.name": "eck-es-master-1", "message": "reading container [device-logs], blob [snap-0BctTRQiSXioKnCyiGruaw.dat]", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "yHehmN5jSlK7_YWh2P3eOA"  }
{"type": "server", "timestamp": "2021-12-02T12:27:18,448Z", "level": "TRACE", "component": "o.e.r.a.AzureBlobContainer", "cluster.name": "eck", "node.name": "eck-es-master-1", "message": "readBlob(snap-cIyXfQdZSj20_tGvT3r0oQ.dat) from position [0] with length [unlimited]", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "yHehmN5jSlK7_YWh2P3eOA"  }
{"type": "server", "timestamp": "2021-12-02T12:27:18,448Z", "level": "TRACE", "component": "o.e.r.a.AzureBlobStore", "cluster.name": "eck", "node.name": "eck-es-master-1", "message": "reading container [device-logs], blob [snap-cIyXfQdZSj20_tGvT3r0oQ.dat]", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "yHehmN5jSlK7_YWh2P3eOA"  }
{"type": "server", "timestamp": "2021-12-02T12:27:18,455Z", "level": "TRACE", "component": "o.e.r.a.AzureBlobContainer", "cluster.name": "eck", "node.name": "eck-es-master-1", "message": "readBlob(snap-IjyNMCFzQrCRDtxKxkHlvQ.dat) from position [0] with length [unlimited]", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "yHehmN5jSlK7_YWh2P3eOA"  }
{"type": "server", "timestamp": "2021-12-02T12:27:18,455Z", "level": "TRACE", "component": "o.e.r.a.AzureBlobStore", "cluster.name": "eck", "node.name": "eck-es-master-1", "message": "reading container [device-logs], blob [snap-IjyNMCFzQrCRDtxKxkHlvQ.dat]", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "yHehmN5jSlK7_YWh2P3eOA"  }

This file has been truncated. show original

DavidTurner · December 2, 2021, 1:42pm

Yes, here's the start & end of the snapshot, 5'53" apart:

{"type": "server", "timestamp": "2021-12-02T12:27:29,002Z", "level": "INFO", "component": "o.e.s.SnapshotsService", "cluster.name": "eck", "node.name": "eck-es-master-1", "message": "snapshot [azurebackup:daily-snapshot-2021.12.02-bw85jb2-t96eaomckhf1xq/_Ca507mNShyh5qSFN9jsyg] started", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "yHehmN5jSlK7_YWh2P3eOA"  }
{"type": "server", "timestamp": "2021-12-02T12:33:22,656Z", "level": "INFO", "component": "o.e.s.SnapshotsService", "cluster.name": "eck", "node.name": "eck-es-master-1", "message": "snapshot [azurebackup:daily-snapshot-2021.12.02-bw85jb2-t96eaomckhf1xq/_Ca507mNShyh5qSFN9jsyg] completed with state [SUCCESS]", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "yHehmN5jSlK7_YWh2P3eOA"  }

The time was all spent waiting for this message:

{"type": "server", "timestamp": "2021-12-02T12:33:21,751Z", "level": "TRACE", "component": "o.e.s.SnapshotsService", "cluster.name": "eck", "node.name": "eck-es-master-1", "message": "received updated snapshot restore state [ShardSnapshotUpdate{snapshot=azurebackup:daily-snapshot-2021.12.02-bw85jb2-t96eaomckhf1xq/_Ca507mNShyh5qSFN9jsyg, shardId=[device_logs-000009][0], repoShardId=null, updatedState=ShardSnapshotStatus[state=SUCCESS, nodeId=Um5J-w5fSgemzzKwhzC66g, reason=null, generation=Lb0ZY79aSF2DZBerehLCGA, shardSnapshotResult=ShardSnapshotResult{generation='Lb0ZY79aSF2DZBerehLCGA', size=6.8mb, segmentCount=2}]}]", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "yHehmN5jSlK7_YWh2P3eOA"  }

So the question is which node is Um5J-w5fSgemzzKwhzC66g and what was it doing between 2021-12-02T12:27:33,202Z and 2021-12-02T12:33:21,751Z?

guptaparv-rlr · December 2, 2021, 1:54pm

It is the data node where the snapshot is being run. The high CPU and API responses are from the same node shared in previous posts.

I captured the logs during the snapshot process, but it doesn't contain much.

{"type": "server", "timestamp": "2021-12-02T12:33:22,511Z", "level": "DEBUG", "component": "o.e.r.b.BlobStoreRepository", "cluster.name": "eck", "node.name": "eck-es-data-1", "message": "Updated repository generation from [216] to [217]", "cluster.uuid": "NYGS85kwQAO4m1cpEpjU1g", "node.id": "Um5J-w5fSgemzzKwhzC66g"  }

DavidTurner · December 2, 2021, 2:02pm

That doesn't make sense, this node spent several minutes taking a snapshot of [device_logs-000009][0]. Are you sure you configured the trace loggers as mentioned above?

guptaparv-rlr · December 2, 2021, 2:06pm

Yeah, I used the _cluster/settings endpoint to update it as a transient setting. I hope this is the right way.

{
  "persistent": {},
  "transient": {
    "cluster": {
      "routing": {
        "allocation": {
          "enable": "all",
          "exclude": {
            "_name": "none_excluded"
          }
        }
      }
    },
    "logger": {
      "org": {
        "elasticsearch": {
          "snapshots": "TRACE",
          "repositories": "TRACE"
        }
      }
    }
  }
}

DavidTurner · December 2, 2021, 2:34pm

That looks correct, but still there's something wrong if that's all the logs you get on the data nodes. These are the logs that I get on a 7.14.2 data node when taking a snapshot:

{"type":"server","timestamp":"2021-12-02T14:29:26,877Z","level":"TRACE","component":"o.e.s.SnapshotShardsService","cluster.name":"elasticsearch","node.name":"node-1","message":"[[device_logs-000009][0]] adding shard to the queue","cluster.uuid":"f6tgGPWhTomXEtIs8_j3Ig","node.id":"Uh-tpu9fSRiYvba_Fm1b1Q"}
{"type":"server","timestamp":"2021-12-02T14:29:27,139Z","level":"DEBUG","component":"o.e.r.b.BlobStoreRepository","cluster.name":"elasticsearch","node.name":"node-1","message":"[[device_logs-000009][0]] [snap-1/NynJN08LQXGHv6xhCmmkxQ] snapshot to [default-repo] [_new] ...","cluster.uuid":"f6tgGPWhTomXEtIs8_j3Ig","node.id":"Uh-tpu9fSRiYvba_Fm1b1Q"}
{"type":"server","timestamp":"2021-12-02T14:29:27,140Z","level":"TRACE","component":"o.e.r.b.BlobStoreRepository","cluster.name":"elasticsearch","node.name":"node-1","message":"[[device_logs-000009][0]] [snap-1/NynJN08LQXGHv6xhCmmkxQ] Loading store metadata using index commit [SnapshotIndexCommit{IndexFileDeleter.CommitPoint(segments_3)}]","cluster.uuid":"f6tgGPWhTomXEtIs8_j3Ig","node.id":"Uh-tpu9fSRiYvba_Fm1b1Q"}
{"type":"server","timestamp":"2021-12-02T14:29:27,141Z","level":"TRACE","component":"o.e.r.b.BlobStoreRepository","cluster.name":"elasticsearch","node.name":"node-1","message":"[[device_logs-000009][0]] [snap-1/NynJN08LQXGHv6xhCmmkxQ] Processing [_0.cfe]","cluster.uuid":"f6tgGPWhTomXEtIs8_j3Ig","node.id":"Uh-tpu9fSRiYvba_Fm1b1Q"}
{"type":"server","timestamp":"2021-12-02T14:29:27,141Z","level":"TRACE","component":"o.e.r.b.BlobStoreRepository","cluster.name":"elasticsearch","node.name":"node-1","message":"[[device_logs-000009][0]] [snap-1/NynJN08LQXGHv6xhCmmkxQ] Processing [_0.si]","cluster.uuid":"f6tgGPWhTomXEtIs8_j3Ig","node.id":"Uh-tpu9fSRiYvba_Fm1b1Q"}
{"type":"server","timestamp":"2021-12-02T14:29:27,142Z","level":"TRACE","component":"o.e.r.b.BlobStoreRepository","cluster.name":"elasticsearch","node.name":"node-1","message":"[[device_logs-000009][0]] [snap-1/NynJN08LQXGHv6xhCmmkxQ] Processing [_0.cfs]","cluster.uuid":"f6tgGPWhTomXEtIs8_j3Ig","node.id":"Uh-tpu9fSRiYvba_Fm1b1Q"}
{"type":"server","timestamp":"2021-12-02T14:29:27,142Z","level":"TRACE","component":"o.e.r.b.BlobStoreRepository","cluster.name":"elasticsearch","node.name":"node-1","message":"[[device_logs-000009][0]] [snap-1/NynJN08LQXGHv6xhCmmkxQ] Processing [segments_3]","cluster.uuid":"f6tgGPWhTomXEtIs8_j3Ig","node.id":"Uh-tpu9fSRiYvba_Fm1b1Q"}
{"type":"server","timestamp":"2021-12-02T14:29:27,220Z","level":"TRACE","component":"o.e.r.b.BlobStoreRepository","cluster.name":"elasticsearch","node.name":"node-1","message":"[default-repo] Writing [__YRxihuABRruRMWtkn19UmA] to [[indices][7ZNV5ILUS_epAGTs9t5ivA][0]]","cluster.uuid":"f6tgGPWhTomXEtIs8_j3Ig","node.id":"Uh-tpu9fSRiYvba_Fm1b1Q"}
{"type":"server","timestamp":"2021-12-02T14:29:27,220Z","level":"TRACE","component":"o.e.r.b.BlobStoreRepository","cluster.name":"elasticsearch","node.name":"node-1","message":"[default-repo] Writing [__F_fl08tDSPuLd3c_Qsixmg] to [[indices][7ZNV5ILUS_epAGTs9t5ivA][0]]","cluster.uuid":"f6tgGPWhTomXEtIs8_j3Ig","node.id":"Uh-tpu9fSRiYvba_Fm1b1Q"}
{"type":"server","timestamp":"2021-12-02T14:29:27,245Z","level":"TRACE","component":"o.e.r.b.BlobStoreRepository","cluster.name":"elasticsearch","node.name":"node-1","message":"[[device_logs-000009][0]] [snap-1/NynJN08LQXGHv6xhCmmkxQ] writing shard snapshot file","cluster.uuid":"f6tgGPWhTomXEtIs8_j3Ig","node.id":"Uh-tpu9fSRiYvba_Fm1b1Q"}
{"type":"server","timestamp":"2021-12-02T14:29:27,321Z","level":"DEBUG","component":"o.e.s.SnapshotShardsService","cluster.name":"elasticsearch","node.name":"node-1","message":"[[device_logs-000009][0]][default-repo:snap-1/NynJN08LQXGHv6xhCmmkxQ] completed snapshot to [default-repo] with status [index shard snapshot status (stage=DONE, startTime=1638455366943, totalTime=203, incrementalFileCount=4, totalFileCount=4, processedFileCount=2, incrementalSize=3080, totalSize=3080, processedSize=2399, indexVersion=3, failure='null')] at generation [cb8M3S3ER_GdzcuuD7_gsw]","cluster.uuid":"f6tgGPWhTomXEtIs8_j3Ig","node.id":"Uh-tpu9fSRiYvba_Fm1b1Q"}
{"type":"server","timestamp":"2021-12-02T14:29:27,604Z","level":"TRACE","component":"o.e.s.SnapshotShardsService","cluster.name":"elasticsearch","node.name":"node-1","message":"[[device_logs-000009][0]][default-repo:snap-1/NynJN08LQXGHv6xhCmmkxQ] updated snapshot state to [ShardSnapshotStatus[state=SUCCESS, nodeId=Uh-tpu9fSRiYvba_Fm1b1Q, reason=null, generation=cb8M3S3ER_GdzcuuD7_gsw, shardSnapshotResult=ShardSnapshotResult{generation='cb8M3S3ER_GdzcuuD7_gsw', size=3kb, segmentCount=1}]]","cluster.uuid":"f6tgGPWhTomXEtIs8_j3Ig","node.id":"Uh-tpu9fSRiYvba_Fm1b1Q"}
{"type":"server","timestamp":"2021-12-02T14:29:28,447Z","level":"DEBUG","component":"o.e.r.b.BlobStoreRepository","cluster.name":"elasticsearch","node.name":"node-1","message":"Updated repository generation from [1] to [2]","cluster.uuid":"f6tgGPWhTomXEtIs8_j3Ig","node.id":"Uh-tpu9fSRiYvba_Fm1b1Q"}

Even if the snapshot is a no-op you still get this:

{"type":"server","timestamp":"2021-12-02T14:29:39,856Z","level":"TRACE","component":"o.e.s.SnapshotShardsService","cluster.name":"elasticsearch","node.name":"node-1","message":"[[device_logs-000009][0]] adding shard to the queue","cluster.uuid":"f6tgGPWhTomXEtIs8_j3Ig","node.id":"Uh-tpu9fSRiYvba_Fm1b1Q"}
{"type":"server","timestamp":"2021-12-02T14:29:39,857Z","level":"DEBUG","component":"o.e.r.b.BlobStoreRepository","cluster.name":"elasticsearch","node.name":"node-1","message":"[[device_logs-000009][0]] [snap-2/IPaIagunQh6R_isGZepm_g] snapshot to [default-repo] [cb8M3S3ER_GdzcuuD7_gsw] ...","cluster.uuid":"f6tgGPWhTomXEtIs8_j3Ig","node.id":"Uh-tpu9fSRiYvba_Fm1b1Q"}
{"type":"server","timestamp":"2021-12-02T14:29:39,922Z","level":"TRACE","component":"o.e.r.b.BlobStoreRepository","cluster.name":"elasticsearch","node.name":"node-1","message":"[[device_logs-000009][0]] [snap-2/IPaIagunQh6R_isGZepm_g] writing shard snapshot file","cluster.uuid":"f6tgGPWhTomXEtIs8_j3Ig","node.id":"Uh-tpu9fSRiYvba_Fm1b1Q"}
{"type":"server","timestamp":"2021-12-02T14:29:40,014Z","level":"DEBUG","component":"o.e.s.SnapshotShardsService","cluster.name":"elasticsearch","node.name":"node-1","message":"[[device_logs-000009][0]][default-repo:snap-2/IPaIagunQh6R_isGZepm_g] completed snapshot to [default-repo] with status [index shard snapshot status (stage=DONE, startTime=1638455379723, totalTime=202, incrementalFileCount=0, totalFileCount=4, processedFileCount=0, incrementalSize=0, totalSize=3080, processedSize=0, indexVersion=3, failure='null')] at generation [I6FxrZpmS666-V3IujjyVw]","cluster.uuid":"f6tgGPWhTomXEtIs8_j3Ig","node.id":"Uh-tpu9fSRiYvba_Fm1b1Q"}
{"type":"server","timestamp":"2021-12-02T14:29:40,340Z","level":"TRACE","component":"o.e.s.SnapshotShardsService","cluster.name":"elasticsearch","node.name":"node-1","message":"[[device_logs-000009][0]][default-repo:snap-2/IPaIagunQh6R_isGZepm_g] updated snapshot state to [ShardSnapshotStatus[state=SUCCESS, nodeId=Uh-tpu9fSRiYvba_Fm1b1Q, reason=null, generation=I6FxrZpmS666-V3IujjyVw, shardSnapshotResult=ShardSnapshotResult{generation='I6FxrZpmS666-V3IujjyVw', size=3kb, segmentCount=1}]]","cluster.uuid":"f6tgGPWhTomXEtIs8_j3Ig","node.id":"Uh-tpu9fSRiYvba_Fm1b1Q"}
{"type":"server","timestamp":"2021-12-02T14:29:41,081Z","level":"DEBUG","component":"o.e.r.b.BlobStoreRepository","cluster.name":"elasticsearch","node.name":"node-1","message":"Updated repository generation from [2] to [3]","cluster.uuid":"f6tgGPWhTomXEtIs8_j3Ig","node.id":"Uh-tpu9fSRiYvba_Fm1b1Q"}

Topic		Replies	Views
Master ES Node pegged at 100% CPU Utilization Elasticsearch	1	767	July 5, 2017
High CPU Usage in Elastic Elasticsearch	4	509	July 27, 2022
Strange High CPU usage related to licensing Elasticsearch	7	585	July 22, 2021
ES Server high CPU on CloseableThreadLocal Elasticsearch	4	1031	December 19, 2017
High CPU usage on only 1 Data node Elasticsearch	7	911	October 16, 2020

High CPU utilization on master node

Related topics