I am working on a project using ElasticSearch and querying it to fetch the member information. It has ~30 Lakhs records.
Basically, I am running a campaign for 20L users and the user data is present on elasticsearch6.2. I query the ES and fetches the records in batches(50 records at a time) using the scroll. Also, I want to keep the SEARCH context for 1 day because if the campaign running process fails due to any reason, I can resume the campaign from where it was stopped. In this way, I will escape from starting the campaign again from starting. I am also saving the scrollID and will use it to resume campaign.
While testing I found CPU Utilization increased by 50% (ES config: 2 nodes with 4 shards running on aws, Instance Type:i3.xlarge.elasticsearch) and its CPU Utilization remains consistent to 50%.
Is there any relation between CPU Utilization and keeping the search context for 1day. BTW campaigns take 6 hours to finish.
I've never used a Scroll with more than 1 minute timeout and there is probably a price to pay for keeping the search context open for much longer. The official Scroll documentations warns that:
an open search context prevents the old segments from being deleted while they are still in use. [...] Keeping older segments alive means that more file handles are needed.
But I'm not sure if this explains the increased CPU you're seeing.
An alternative to using Scroll is the light weight Search After mechanism, which is very useful if you can order your records on a unique field - for instance a sequence number or a date timestamp. With this mechanism you're not keeping an expensive open state in Elasticsearch, because each search request knows where to start (after). Hence you can perform one search today and the next in a week, with no extra cost.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.