We have multiple environments set up for our Elasticsearch system, and I am experiencing some odd behavior. In the dev environment, when we perform a bulk update, the CPU utilization reaches towards max, and the ingest happens fairly quickly. This is expected as we have a requirement to optimize for minimizing ingestion time.
In another environment which has similar, if not better specs, the CPU utilization does not reach anywhere it did in the dev environment.
My question is, what factors go into the amount of CPU utilization the Elasticsearch system will use when performing a bulk update? Our system is much, much heavier on the ingestion, and fairly lighter on the search, so are there any suggestions on what we can do to optimize for this?
What I have tried so far is to increase the number of threads allocated for the "write" threadpool, but have found a hard limit is set to 1+#allocable cores. So not much to do there.
When evaluating the other primitives and their utilization during this process, the Disk reaches max, and only for shortly, ~60% disk utilization, with average around 4-5%. The CPU utilization is average 10-20% with a max of 70%. Finally network bandwidth utilization is around 20-25%. Memory used by Elasticsearch is constant around 55%.
Curious to hear what people think, and thank you for taking the time to read my question.
To follow up on this thread.
Initially I had a single-node set up, realized that creating a dedicated ingestion node may help with separation of concern along with isolating the resources dedicated to ingestion and searching.
Although now it seems that this dedicated ingestion node is not utilizing anywhere near the single-node topology. I know that I am missing something.... Another piece of this puzzle is that when I try and attempt to do the bulk update, the NEST client always communicated that a bulk update is failing.
I had set the retry to 5 so I'd imagine that may be a culprit on why the bulk update is taking so long, since everything is being retried 5 times. So my goal is to figure out how to send the right amount of data to the cluster so that very few if any requests fail.
If someone could just point me to where I can figure out the reason why a bulk indexing would fail in the logs that would help tremendously since the NEST client communicates is only communicating: "Bulk indexing failed and after 1 retry."
I have found that the issue I was running into... the culprit was that my enrich policy was failing, triggering the retry in the update. This was a really good learning experience for myself as it speaks to the importance of documentation of the commands used to set up the data stores. As everything should be the exact same from environment to environment. (My dev environment had had "ignore_failure: true" while the other did not."
Anyways through my learnings Ill propose a few questions if there are any Elastic Folks out there reading this thread.
Do errors with information identifying the source of the bulk update ever surface back to the NEST client? If not would it be possible to offer in a future release some high level information back to the client?
Adjacent to the above request is there a way to communicate to the NEST client on "BulkAll" via maybe the BulkAllRequest which errors should and should not be retried? For example, I would of course want to retry on some set of transient errors, but not on a failed ingest pipeline, because no matter how much I would like to retry - it will always fail
Anyways for folks who have read/visited this post, I hope that this offers some sort ideas on what you would be able to try if debugging a similar issue.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.