Please Help to understand performance problem

We just started to use Elastic Cloud solution.
When swithced a part of our services to it - we started to have performance degradation.

My deployment have 2 data nodes HOT, and 2 nodes for WARM storage.

Instance #0
    Healthy
    v7.12.0
    4 GB RAM
        azure.data.highio.l32sv2
        data_hot
        data_content
        master
        coordinating
        ingest

Instance #1

    Healthy
    v7.12.0
    4 GB RAM
        azure.data.highio.l32sv2
        data_hot
        data_content
        master eligible
        coordinating
        ingest

    azure.data.highio.l32sv2
    data_hot

Monitoring reported 100% CPU utilization by node1
I can see from Perfomance graphs - we usually use all CPU credits.

Maybe i have badly planned architecture ?
Or only 1 way to fix it - increase cluster size by switching to bigger DATA HOT nodes ?

Please help me to understand what's going on.
Trying find most CPU intensive tasks.

GET /_nodes/instance-0000000001/hot_threads

::: {instance-0000000001}{NPwmqbLnQt-bS9RI-vky6Q}{a-1hnleJRGyyJvP_FIa5Pg}{10.46.24.43}{10.46.24.43:19576}{himrst}{logical_availability_zone=zone-1, server_name=instance-0000000001.e61c85c0f72e451e85c281a6c4db29c5, availability_zone=westeurope-2, xpack.installed=true, data=hot, instance_configuration=azure.data.highio.l32sv2, transform.node=true, region=unknown-region}
   Hot threads at 2021-04-21T15:06:02.888Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   24.8% (123.8ms out of 500ms) cpu usage by thread 'elasticsearch[instance-0000000001][write][T#1]'
     3/10 snapshots sharing following 22 elements
       app//org.elasticsearch.index.mapper.DocumentParser.innerParseObject(DocumentParser.java:434)
       app//org.elasticsearch.index.mapper.DocumentParser.parseObjectOrNested(DocumentParser.java:405)
       app//org.elasticsearch.index.mapper.DocumentParser.internalParseDocument(DocumentParser.java:111)
       app//org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:69)
       app//org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:51)
       app//org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:121)
       app//org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:852)
       app//org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:829)
       app//org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnReplica(IndexShard.java:808)
       app//org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:469)
       app//org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:451)
       app//org.elasticsearch.action.bulk.TransportShardBulkAction.lambda$dispatchedShardOperationOnReplica$5(TransportShardBulkAction.java:416)
       app//org.elasticsearch.action.bulk.TransportShardBulkAction$$Lambda$6451/0x0000000801ce28d8.get(Unknown Source)
       app//org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:329)
       app//org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnReplica(TransportShardBulkAction.java:415)
       app//org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnReplica(TransportShardBulkAction.java:74)
       app//org.elasticsearch.action.support.replication.TransportWriteAction$2.doRun(TransportWriteAction.java:193)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       java.base@15.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.base@15.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@15.0.1/java.lang.Thread.run(Thread.java:832)
     2/10 snapshots sharing following 14 elements
       app//org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnReplica(IndexShard.java:808)
       
    app//org.elasticsearch.action.bulk.TransportShardBulkAction.lambda$dispatchedShardOperationOnReplica$5(TransportShardBulkAction.java:416)
       app//org.elasticsearch.action.bulk.TransportShardBulkAction$$Lambda$6451/0x0000000801ce28d8.get(Unknown Source)
       app//org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:329)
       app//org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnReplica(TransportShardBulkAction.java:415)
       app//org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnReplica(TransportShardBulkAction.java:74)
       app//org.elasticsearch.action.support.replication.TransportWriteAction$2.doRun(TransportWriteAction.java:193)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       java.base@15.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.base@15.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@15.0.1/java.lang.Thread.run(Thread.java:832)
     5/10 snapshots sharing following 10 elements
       java.base@15.0.1/jdk.internal.misc.Unsafe.park(Native Method)
       java.base@15.0.1/java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
       java.base@15.0.1/java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:743)
       java.base@15.0.1/java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:684)
       java.base@15.0.1/java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1366)
       app//org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:154)
       java.base@15.0.1/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1056)
       java.base@15.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116)
       java.base@15.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@15.0.1/java.lang.Thread.run(Thread.java:832)
   
   22.3% (111.2ms out of 500ms) cpu usage by thread 'elasticsearch[instance-0000000001][write][T#2]'
     3/10 snapshots sharing following 17 elements
       app//org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:951)
       app//org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:872)
       app//org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:844)
       app//org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnReplica(IndexShard.java:808)
       app//org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:469)
       app//org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:451)
       app//org.elasticsearch.action.bulk.TransportShardBulkAction.lambda$dispatchedShardOperationOnReplica$5(TransportShardBulkAction.java:416)
       app//org.elasticsearch.action.bulk.TransportShardBulkAction$$Lambda$6451/0x0000000801ce28d8.get(Unknown Source)
       app//org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:329)
       app//org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnReplica(TransportShardBulkAction.java:415)
       app//org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnReplica(TransportShardBulkAction.java:74)
       app//org.elasticsearch.action.support.replication.TransportWriteAction$2.doRun(TransportWriteAction.java:193)
       
       java.base@15.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@15.0.1/java.lang.Thread.run(Thread.java:832)
     2/10 snapshots sharing following 8 elements
       app//org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnReplica(TransportShardBulkAction.java:415)
       app//org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnReplica(TransportShardBulkAction.java:74)
       app//org.elasticsearch.action.support.replication.TransportWriteAction$2.doRun(TransportWriteAction.java:193)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       java.base@15.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.base@15.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@15.0.1/java.lang.Thread.run(Thread.java:832)
     5/10 snapshots sharing following 10 elements
       java.base@15.0.1/jdk.internal.misc.Unsafe.park(Native Method)
       java.base@15.0.1/java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
       java.base@15.0.1/java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:743)
       java.base@15.0.1/java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:684)
       java.base@15.0.1/java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1366)
       app//org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:154)
       java.base@15.0.1/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1056)
       java.base@15.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116)
       java.base@15.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@15.0.1/java.lang.Thread.run(Thread.java:832)
   
   12.7% (63.6ms out of 500ms) cpu usage by thread 'elasticsearch[instance-0000000001][transport_worker][T#1]'
     2/10 snapshots sharing following 20 elements
       io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1267)
       io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1314)
       io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:501)
       io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:440)
       io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
       io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
       io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
       io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
       io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
       io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615)
       io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578)
       io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
       io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
       io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
       java.base@15.0.1/java.lang.Thread.run(Thread.java:832)
     7/10 snapshots sharing following 9 elements
       java.base@15.0.1/sun.nio.ch.EPoll.wait(Native Method)
       java.base@15.0.1/sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:120)
       java.base@15.0.1/sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:129)
       java.base@15.0.1/sun.nio.ch.SelectorImpl.select(SelectorImpl.java:146)
       io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:803)
       io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:457)
       io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
       io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
       java.base@15.0.1/java.lang.Thread.run(Thread.java:832)


On Elastic Cloud I believe CPU allocation is proportional to the size of the node in terms of RAM and 4GB nodes are quite small. It sounds like you are overloading it. How much data are you indexing per day? How are you indexing this data? How long are you keeping data on the hot nodes?

Thanks for reply.
We got about ~ 50 Gb data per day.
About ~85 000 000 messages.
This is indexing on 2 HOT nodes - or i didn't understood you qestion about indexing - can you please clarify ?

Today night i did upscale the deployment to 2*8GB RAM data.hot nodes - and today i can see Kibana search performance is much better. And CPU credits still not used.

Regarding my ILM:

"filebeat" : {
    "version" : 8,
    "modified_date" : "2021-04-21T15:03:51.597Z",
    "policy" : {
      "phases" : {
        "hot" : {
          "min_age" : "0ms",
          "actions" : {
            "rollover" : {
              "max_size" : "50gb",
              "max_age" : "7d"
            }
          }
        },
        "delete" : {
          "min_age" : "30d",
          "actions" : {
            "delete" : {
              "delete_searchable_snapshot" : true
            }
          }
        },
        "warm" : {
          "min_age" : "7d",
          "actions" : {
            "set_priority" : {
              "priority" : 50
            }
          }
        }
      }
    }
  }
}

Any reccomendations to improve ILM ?
Thanks

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.