I am having an issue after upgrading my cluster from ES 1.7 to 5.2. The reindexing and upgrade were done two days ago and I did not have an issue on 1.7 before upgrading.
The bulk update tasks are taking a long time to finish, sometimes it even reaches an hour. My cluster consists of 5 nodes running on AWS EC2 with 50 shards and 1 replica. All my EBS volumes are IOPS provisioned with 10000 IOPS.
The EC2 instances are m4.2xlarge with 8 vCPUs and 32GB of ram. The heap size is set to 15GB.
I am closely monitoring the nodes and I do not see anything that should cause the slowness. Searching is blazing fast, CPU and memory usage are well within the accepted ranges and the used heap percentage is around the 65% mark. CPU is around 20% average.
Iostat shows very normal behaviour:
Linux 3.13.0-107-generic (ip-10-0-5-168) 03/21/2017 x86_64 (16 CPU)
The original 1.7 had the same amount of shards and we did not want to change it. Plus we want to maintain a single index so we created 50 shards to keep them small and to be able to scale horizonally in the future. Currently the index hold 75M documents and is about 200GB
Yes we tried increasing the bulk size and it made matters worse.
The updates are partial doc updates. I have tried 100 bulk updates per requests which is my default. I tried setting it to 1000 per request and 50. No difference in performance whatsoever.
One extra thing I noticed today, not sure if relevant, is that almost all the bulk workload is dedicated to a single node. The node itself is not the master node:
What size are the documents you are updating? Are you using scripted updates? Have you tried increasing the number of threads issuing updates?
As Elasticsearch 5.x syncs to disk much more frequently than earlier versions in order to enhance durability, I would expect larger bulk sizes to improve performance, especially as you are indexing into a large number of shards. Does performance change if you try with an even larger bulk size, e.g. 10000?
My average doc size is around 1KB. I do use scripted updates but only with update_by_query and i do not see anything alarming about those as all my other thread polls seems to work blazingly fast.
Initially i was trying with a single thread now i am running 10 threads. Performance improved but still the requests take a long time to process.
I will try increasing the bulk size and let you know.
Is there perhaps a single monster-size doc that keeps growing?
I remember the story of a user with a large (>1GB !?) document that was continually being added to.
Can you perhaps show us exactly what you are doing? What does one of your requests look like? Do you have monitoring installed so you can share stats on indexing etc?
This is the current state of my cluster. Do you see any abnormalities?
I have fiddled with the number of items per bulk request and the number of threads running. Now I am managing about 8000 update requests per 20 minutes. It is still very slow. I do not see the delays in the monitoring graphs.
I converted my code to do GET/INSERT instead of using Update or Bulk Update. I managed to update 1M documents in 10 minutes running 500 threads. This includes pulling the document, updating the fields and inserting it back to ES.
My Cluster was running mighty fine during this process and it would definitely accept more inserts if I wished to increase the number of threads. No issues with CPU, Disk or Memory
This leads me to believe that there is something really wrong with the way ES Update works.
This is my latest index info. Since I change the code everything is working smoothly and updating the documents only takes milliseconds. The red line indicates the time when I updated the code to use a get/insert combination instead of bulk updates.
I did not notice any changes in the index rate and I am not sure if ES counts updates with this rate as it should increase drastically.
I believe there is an issue with bulk updates and hope it gets fixed soon.
Have you done any profiling of this situation? For example, have you used the hot threads API to see what the shards are doing when executing the bulk requests? Have you attached a profiler to understand where the shard is spending it time when executing the bulk request? You should do this on a node holding a shard executing the bulk request, not the coordinating node receiving the bulk request.
Unfortunately I have migrated the code and since the issue is happening on my production cluster I cannot revert back to using bulk for testing purposes. I would have loved to have known about these sooner so I might help fix the issue.
Hey All,
The same problem:
Elasticsearch 5.2 with x-pack
AWS EC2 2 x i3.2xlarge RAM 61GB (31GB heap ), SSD
Ubuntu 16.04
A few indexes with 64 shards and 1 replica (360GB indexes size)
Too slow _bulk update with high CPU
_bulk indexing is fast
Example:
POST _bulk
{"update":{"_index":"cc_3","_type":"job","_id":"2124_cca74860dae5c0f7832d846823873808_228_i"}}
{"doc":{"additional_fields":null}}
{"update":{"_index":"cc_3","_type":"job","_id":"2124_cca74860dae5c0f7832d846823873808_228_i"}}
{"doc":{"additional_fields":{"ats":"none"}}}
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.