Elasticsearch high CPU usage on a mostly bulk indexing use case

ssouris · May 18, 2020, 8:31pm

Hi, we are running Elasticsearch 7.6.2 on our own in AWS.

We have:

6 data nodes on i3en.xlarge instances
3 dedicated master nodes
200 indices storing ~800million documents.
Total size ~5TB
30s refresh interval

We are observing high CPU usage most of the time (above 60-80% with spikes at 100%).
Although the performance of the cluster seems acceptable:

ingesting ~500 docs/second
no rejections on bulk thread pool
query times look good

I'm a bit concerned about the CPU usage, because we expect more data from our customers, and we are going to add more nodes or even increase the specs of the existing ones to handle all the additional load.

This is the output from Hot Threads:

gist.github.com

https://gist.github.com/ssouris/9475140109fa12990d58c2ff97f40371

HotThreadsResponse.log

::: {node1.mycompany.net}{9GFXG3OWTsqz-dzQQblUmw}{UUtKg-5cSuaZuKtQAjfnMA}{10.48.112.217}{10.48.112.217:9300}{di}{aws_az=us-west-1b, xpack.installed=true}
   Hot threads at 2020-05-18T20:16:02.896Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

    6.1% (30.3ms out of 500ms) cpu usage by thread 'elasticsearch[node1.mycompany.net][write][T#2]'
     unique snapshot
       java.base@13.0.2/java.util.stream.Streams$ConcatSpliterator.<init>(Streams.java:705)
       java.base@13.0.2/java.util.stream.Streams$ConcatSpliterator$OfRef.<init>(Streams.java:773)
       java.base@13.0.2/java.util.stream.Stream.concat(Stream.java:1380)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ThreadContextStruct.putResponseHeaders(ThreadContext.java:522)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ThreadContextStruct.access$1900(ThreadContext.java:440)

This file has been truncated. show original

It seems like Elasticsearch is spending a lot of time merging.

I would like to understand if such a high CPU usage is expected on a well-balanced cluster, especially when it's mostly bulk indexing.
I tried to increase refresh_interval but the CPU usage didn't change at all. GC activity changed a bit for sure, and the merge sizes increased but the CPU remained high.

Thanking in advance.

Christian_Dahlqvist · May 19, 2020, 5:17am

What is the size of the documents? Are nested documents used to a large extent? Are you updating documents or just indexing new ones? Are you by any chance forcing a refresh after each bulk request?

ssouris · May 19, 2020, 3:36pm

Thanks for answering so fast.

What is the size of the documents?
3-4KB per document. Also, most of the documents have more than 1K fields.

Are nested documents used to a large extent?
Yes, 1/6th of the documents have nested documents.

Are you updating documents or just indexing new ones?
Only indexing new ones.

Are you by any chance forcing a refresh after each bulk request?
I am using Jest Client to connect to the cluster and I am not passing any parameter to force a refresh.
I, also, debugged the code and I didn't see any parameter being passed to refresh immediately after the bulk insert by the client library.

Christian_Dahlqvist · May 19, 2020, 4:06pm

How can a document of 4kB have 1000 fields? Does not sound right.

Large and complex documents require Elasticsearch to do as lot of work per document so will be a lot slower and use more CPU than smaller documents.

ssouris · May 19, 2020, 5:37pm

Actually I sent the average from a daily index.

Some of the documents can go over the 1K field limit and can be more than 100KB in size.

I guess if I manage to reduce the document size somehow and/or avoid indexing fields that are not needed I can get the CPU usage down.

ssouris · May 20, 2020, 2:29pm

Related to the high CPU.

I observed that all the daily indices are getting refreshed, which makes sense, but most of my indices especially older than 2-5 days, will never change, so refresh is not needed.

I will try to disable refresh_interval on older indices to see if it makes any difference.

ssouris · May 22, 2020, 11:20pm

In the end I ended up adding four more nodes to the cluster, and now the CPU usage appears to be normal.
I don't know if I'm obsessing over the correct thing (high CPU usage),

but I feel that:

I might need to configure the bulk size of the writes. Right now we do continuous batches of 200 records straight from Kafka. Maybe I should try increasing that size.
Also in a bulk I might have 6-12 different indices, so maybe it puts more strain to the cluster while bulk indexing.

Rahul_Kumar4 · May 23, 2020, 7:30pm

I am curious as we have a similar situation. Did disabling refresh_interval on older indices make any difference?

ssouris · May 23, 2020, 8:11pm

No difference at all in our case with refresh_interval.

Christian_Dahlqvist · May 23, 2020, 8:21pm

If you are indexing into many indices and shards it may result in small batches being processed by individual shards. This can lead to a lot of disk I/O so it would be good to look at disk utilisation and iowait, e.g. using iostat. What type of storage do you have? Local SSDs?

ssouris · May 26, 2020, 9:59pm

This is how IO operations look like from my dashboard.

I/O operations

Disk Average Wait Time

Yes I'm using Local SSDs.
By adding 4 more nodes to the cluster the CPU usage went lower.

system · June 23, 2020, 9:59pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Very high CPU usage of elastic nodes Elasticsearch	6	2559	March 29, 2018
High CPU load (3-4) and 50% CPU utilization Elasticsearch	5	3923	July 6, 2017
High CPU usage while bulk indexing Elasticsearch	12	7852	August 23, 2018
High CPU Usage in Elastic Elasticsearch	4	510	July 27, 2022
High CPU usage 80/90% on elastic search cluster Elasticsearch	1	350	November 19, 2020

Elasticsearch high CPU usage on a mostly bulk indexing use case

Related topics