Hi, we are running Elasticsearch 7.6.2 on our own in AWS.
We have:
6 data nodes on i3en.xlarge instances
3 dedicated master nodes
200 indices storing ~800million documents.
Total size ~5TB
30s refresh interval
We are observing high CPU usage most of the time (above 60-80% with spikes at 100%).
Although the performance of the cluster seems acceptable:
ingesting ~500 docs/second
no rejections on bulk thread pool
query times look good
I'm a bit concerned about the CPU usage, because we expect more data from our customers, and we are going to add more nodes or even increase the specs of the existing ones to handle all the additional load.
This is the output from Hot Threads:
It seems like Elasticsearch is spending a lot of time merging.
I would like to understand if such a high CPU usage is expected on a well-balanced cluster, especially when it's mostly bulk indexing.
I tried to increase refresh_interval but the CPU usage didn't change at all. GC activity changed a bit for sure, and the merge sizes increased but the CPU remained high.
What is the size of the documents? Are nested documents used to a large extent? Are you updating documents or just indexing new ones? Are you by any chance forcing a refresh after each bulk request?
What is the size of the documents?
3-4KB per document. Also, most of the documents have more than 1K fields.
Are nested documents used to a large extent?
Yes, 1/6th of the documents have nested documents.
Are you updating documents or just indexing new ones?
Only indexing new ones.
Are you by any chance forcing a refresh after each bulk request?
I am using Jest Client to connect to the cluster and I am not passing any parameter to force a refresh.
I, also, debugged the code and I didn't see any parameter being passed to refresh immediately after the bulk insert by the client library.
I observed that all the daily indices are getting refreshed, which makes sense, but most of my indices especially older than 2-5 days, will never change, so refresh is not needed.
I will try to disable refresh_interval on older indices to see if it makes any difference.
In the end I ended up adding four more nodes to the cluster, and now the CPU usage appears to be normal.
I don't know if I'm obsessing over the correct thing (high CPU usage),
but I feel that:
I might need to configure the bulk size of the writes. Right now we do continuous batches of 200 records straight from Kafka. Maybe I should try increasing that size.
Also in a bulk I might have 6-12 different indices, so maybe it puts more strain to the cluster while bulk indexing.
If you are indexing into many indices and shards it may result in small batches being processed by individual shards. This can lead to a lot of disk I/O so it would be good to look at disk utilisation and iowait, e.g. using iostat. What type of storage do you have? Local SSDs?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.