Elasticsearch high CPU usage on a mostly bulk indexing use case

Hi, we are running Elasticsearch 7.6.2 on our own in AWS.

We have:

  • 6 data nodes on i3en.xlarge instances
  • 3 dedicated master nodes
  • 200 indices storing ~800million documents.
  • Total size ~5TB
  • 30s refresh interval

We are observing high CPU usage most of the time (above 60-80% with spikes at 100%).
Although the performance of the cluster seems acceptable:

  • ingesting ~500 docs/second
  • no rejections on bulk thread pool
  • query times look good

I'm a bit concerned about the CPU usage, because we expect more data from our customers, and we are going to add more nodes or even increase the specs of the existing ones to handle all the additional load.

This is the output from Hot Threads:

It seems like Elasticsearch is spending a lot of time merging.

I would like to understand if such a high CPU usage is expected on a well-balanced cluster, especially when it's mostly bulk indexing.
I tried to increase refresh_interval but the CPU usage didn't change at all. GC activity changed a bit for sure, and the merge sizes increased but the CPU remained high.

Thanking in advance.

What is the size of the documents? Are nested documents used to a large extent? Are you updating documents or just indexing new ones? Are you by any chance forcing a refresh after each bulk request?

Thanks for answering so fast.

What is the size of the documents?
3-4KB per document. Also, most of the documents have more than 1K fields.

Are nested documents used to a large extent?
Yes, 1/6th of the documents have nested documents.

Are you updating documents or just indexing new ones?
Only indexing new ones.

Are you by any chance forcing a refresh after each bulk request?
I am using Jest Client to connect to the cluster and I am not passing any parameter to force a refresh.
I, also, debugged the code and I didn't see any parameter being passed to refresh immediately after the bulk insert by the client library.

How can a document of 4kB have 1000 fields? Does not sound right.

Large and complex documents require Elasticsearch to do as lot of work per document so will be a lot slower and use more CPU than smaller documents.

Actually I sent the average from a daily index.

Some of the documents can go over the 1K field limit and can be more than 100KB in size.

I guess if I manage to reduce the document size somehow and/or avoid indexing fields that are not needed I can get the CPU usage down.

Related to the high CPU.

I observed that all the daily indices are getting refreshed, which makes sense, but most of my indices especially older than 2-5 days, will never change, so refresh is not needed.

I will try to disable refresh_interval on older indices to see if it makes any difference.

In the end I ended up adding four more nodes to the cluster, and now the CPU usage appears to be normal.
I don't know if I'm obsessing over the correct thing (high CPU usage),

but I feel that:

  1. I might need to configure the bulk size of the writes. Right now we do continuous batches of 200 records straight from Kafka. Maybe I should try increasing that size.

  2. Also in a bulk I might have 6-12 different indices, so maybe it puts more strain to the cluster while bulk indexing.

I am curious as we have a similar situation. Did disabling refresh_interval on older indices make any difference?

No difference at all in our case with refresh_interval.

If you are indexing into many indices and shards it may result in small batches being processed by individual shards. This can lead to a lot of disk I/O so it would be good to look at disk utilisation and iowait, e.g. using iostat. What type of storage do you have? Local SSDs?

1 Like

This is how IO operations look like from my dashboard.

I/O operations

Disk Average Wait Time

Yes I'm using Local SSDs.
By adding 4 more nodes to the cluster the CPU usage went lower.