Elasticsearch high CPU Utilization

Hi,
I have a elastic setup which consists of 5 datanode, 1 master and 1 client node.
Each datanode consists of 1.5 vCPU, and 4GB Memory. My indexing rate is 10K logs per second.
During the indexing time the CPU usage of the datanode are high (90%).

When I try to query the document parallelly during the indexing time. I getting client request error for my queries.

{
"statusCode": 504,
"error": "Gateway Time-out",
"message": "Client request timeout"
}

And also the datanodes are getting strucked.
Below is the output of the GET /_nodes/hot_threads
Please help me solve the CPU usage issue.

::: {elasticsearch-data-2}{rqtu_kkYTsODGIRnls535w}{NIJMJ2CiQ3K8sKar9-GW_g}{10.24.2.5}{10.24.2.5:9300}{xpack.installed=true}
   Hot threads at 2019-10-16T14:49:19.663, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   32.2% (161ms out of 500ms) cpu usage by thread 'elasticsearch[elasticsearch-data-2][search][T#1]'
     5/10 snapshots sharing following 28 elements
       app//org.elasticsearch.search.aggregations.AggregatorFactory$MultiBucketAggregatorWrapper$1.collect(AggregatorFactory.java:140)
       app//org.elasticsearch.search.aggregations.bucket.BucketsAggregator.collectExistingBucket(BucketsAggregator.java:84)

   
   30.2% (150.9ms out of 500ms) cpu usage by thread 'elasticsearch[elasticsearch-data-2][write][T#2]'
     2/10 snapshots sharing following 36 elements     
                                   
   
   25.7% (128.5ms out of 500ms) cpu usage by thread 'elasticsearch[elasticsearch-data-2][[latest-map][4]: Lucene Merge Thread #2106]'
     5/10 snapshots sharing following 5 elements
       app//org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4412)
       app//org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4061)

That sounds like a very low amount of resources for that kind of indexing rate so I am not surprised you are having issues. You should also make sure you have 3 master eligible nodes in the cluster as having only one is very bad and can lead to data loss.

Could you please share me resource allocation recommendation for this kind use-case.
FYI, we are running the cluster in kubernetes environment as statefulset.

Have a look at the following:

https://www.elastic.co/blog/sizing-hot-warm-architectures-for-logging-and-metrics-in-the-elasticsearch-service-on-elastic-cloud

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.