My average doc size is around 1KB. I do use scripted updates but only with update_by_query and i do not see anything alarming about those as all my other thread polls seems to work blazingly fast.
Initially i was trying with a single thread now i am running 10 threads. Performance improved but still the requests take a long time to process.
I will try increasing the bulk size and let you know.
Is there perhaps a single monster-size doc that keeps growing?
I remember the story of a user with a large (>1GB !?) document that was continually being added to.
Can you perhaps show us exactly what you are doing? What does one of your requests look like? Do you have monitoring installed so you can share stats on indexing etc?
This is the current state of my cluster. Do you see any abnormalities?
I have fiddled with the number of items per bulk request and the number of threads running. Now I am managing about 8000 update requests per 20 minutes. It is still very slow. I do not see the delays in the monitoring graphs.
I converted my code to do GET/INSERT instead of using Update or Bulk Update. I managed to update 1M documents in 10 minutes running 500 threads. This includes pulling the document, updating the fields and inserting it back to ES.
My Cluster was running mighty fine during this process and it would definitely accept more inserts if I wished to increase the number of threads. No issues with CPU, Disk or Memory
This leads me to believe that there is something really wrong with the way ES Update works.
This is my latest index info. Since I change the code everything is working smoothly and updating the documents only takes milliseconds. The red line indicates the time when I updated the code to use a get/insert combination instead of bulk updates.
I did not notice any changes in the index rate and I am not sure if ES counts updates with this rate as it should increase drastically.
I believe there is an issue with bulk updates and hope it gets fixed soon.
Have you done any profiling of this situation? For example, have you used the hot threads API to see what the shards are doing when executing the bulk requests? Have you attached a profiler to understand where the shard is spending it time when executing the bulk request? You should do this on a node holding a shard executing the bulk request, not the coordinating node receiving the bulk request.
Unfortunately I have migrated the code and since the issue is happening on my production cluster I cannot revert back to using bulk for testing purposes. I would have loved to have known about these sooner so I might help fix the issue.
Hey All,
The same problem:
Elasticsearch 5.2 with x-pack
AWS EC2 2 x i3.2xlarge RAM 61GB (31GB heap ), SSD
Ubuntu 16.04
A few indexes with 64 shards and 1 replica (360GB indexes size)
Too slow _bulk update with high CPU
_bulk indexing is fast
Example:
POST _bulk
{"update":{"_index":"cc_3","_type":"job","_id":"2124_cca74860dae5c0f7832d846823873808_228_i"}}
{"doc":{"additional_fields":null}}
{"update":{"_index":"cc_3","_type":"job","_id":"2124_cca74860dae5c0f7832d846823873808_228_i"}}
{"doc":{"additional_fields":{"ats":"none"}}}
Likewise, same problem with ES 5.2 - bulk updates are extremely slow (compared to bulk indexing), we had to apply client-side workaround to reduce the number of updates we are making since ES was simply not keeping up.
Can we get someone from Elastic to look into this? It's a major problem that has been reported by multiple users, not only in this thread but others as well (Elasticsearch bulk update is extremely slow) and so far without any traction.
Glad I am not the only one who has reported on this. Well at least now I know that I was not doing anything wrong with bulk updates.
I honestly don't see why Elastic is ignoring this. I know this is probably something that is not going to be easy to fix as an incremental update to 5.x but at least put it on the roadmap for 6.x
The reason why it's slow is probably the following:
As of 5.0.0 the get API will issue a refresh if the requested document has been changed since the last refresh but the change hasn’t been refreshed yet. This will also make all other changes visible immediately. This can have an impact on performance if the same document is updated very frequently using a read modify update pattern since it might create many small segments. This behavior can be disabled by passing realtime=false to the get request.
An update operation is effectively a GET + INSERT.
The solution would be not to frequently update the same documents again and again. If that's a necessity, you should batch these updates on the application level.
It is unlikely this is the cause - fetching the document over the network (GET), modifying it and reindexing it (INSERT) is still faster than update by several orders of magnitude.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.