Bulk updates are extremely slow after upgrading to 5.2

I am having an issue after upgrading my cluster from ES 1.7 to 5.2. The reindexing and upgrade were done two days ago and I did not have an issue on 1.7 before upgrading.

The bulk update tasks are taking a long time to finish, sometimes it even reaches an hour. My cluster consists of 5 nodes running on AWS EC2 with 50 shards and 1 replica. All my EBS volumes are IOPS provisioned with 10000 IOPS.

The EC2 instances are m4.2xlarge with 8 vCPUs and 32GB of ram. The heap size is set to 15GB.

I am closely monitoring the nodes and I do not see anything that should cause the slowness. Searching is blazing fast, CPU and memory usage are well within the accepted ranges and the used heap percentage is around the 65% mark. CPU is around 20% average.

Iostat shows very normal behaviour:

Linux 3.13.0-107-generic (ip-10-0-5-168) 03/21/2017 x86_64 (16 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
29.28 0.00 0.26 0.59 0.07 69.80

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
xvdf 222.49 4962.23 3983.79 88110729 70737216

And this is the output for _cat/nodes:

ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.0.5.16 42 94 6 1.04 1.16 0.61 mdi - NODE_01
10.0.4.74 53 99 37 6.37 6.36 5.92 mdi - NODE_04
10.0.5.60 68 99 21 4.11 5.00 4.18 mdi - NODE_02
10.0.5.168 69 99 31 4.99 5.14 4.57 mdi * NODE_03
10.0.4.161 72 99 13 1.08 1.70 1.45 mdi - NODE_05

And this is a segment of the tasks API response:

Each bulk call consists of up to 100 update requests. Each request is about 300 bytes. and I have 10 threads running and sending bulk requests.

I have set the index refresh interval to -1 but there was no noticeable difference.

I do not understand why I am having this issue. What might be the cause of this. And how can I speed things up?

There are major changes in 5.2, including how we sync the translog, which may impact this.

A few things;

  1. Why so many shards?
  2. Have you tried increasing the bulk size?

The original 1.7 had the same amount of shards and we did not want to change it. Plus we want to maintain a single index so we created 50 shards to keep them small and to be able to scale horizonally in the future. Currently the index hold 75M documents and is about 200GB

Yes we tried increasing the bulk size and it made matters worse.

I have changed the translog durability to async with a 30s interval. And still no change in performance

Do you by any chance have a lot of different types or non-uniform mappings that may cause frequent updates to mappings and the cluster state?

I have 6 types. All have different mappings. Not sure what you mean by update mapping?

That should be fine. Exactly how are you performing the updates? What other bulk sizes did to try?

The updates are partial doc updates. I have tried 100 bulk updates per requests which is my default. I tried setting it to 1000 per request and 50. No difference in performance whatsoever.

One extra thing I noticed today, not sure if relevant, is that almost all the bulk workload is dedicated to a single node. The node itself is not the master node:

node_name name active queue rejected
Node_06  bulk      5     0        0
Node_07  bulk     18     0        0
Node_04  bulk      5     0        0
Node_02  bulk     50    82       0
Node_05  bulk      9     0        0
Node_08  bulk      7     0        0
Node_03  bulk     13     0        0
Node_09  bulk      8     0        0
Node_01  bulk      0     0        0

Notice how Node_2 has 50 active bulks and 82 in queue yet the rest of the cluster nodes are working on a very small load.

And as you can see I also increased the number of nodes in the cluster from 5 to 9. Still no performance change.

What size are the documents you are updating? Are you using scripted updates? Have you tried increasing the number of threads issuing updates?

As Elasticsearch 5.x syncs to disk much more frequently than earlier versions in order to enhance durability, I would expect larger bulk sizes to improve performance, especially as you are indexing into a large number of shards. Does performance change if you try with an even larger bulk size, e.g. 10000?

My average doc size is around 1KB. I do use scripted updates but only with update_by_query and i do not see anything alarming about those as all my other thread polls seems to work blazingly fast.

Initially i was trying with a single thread now i am running 10 threads. Performance improved but still the requests take a long time to process.

I will try increasing the bulk size and let you know.

Is there perhaps a single monster-size doc that keeps growing?
I remember the story of a user with a large (>1GB !?) document that was continually being added to.

No all documents are relatively the same size. I do have very few large documents but those are never updated

Can you perhaps show us exactly what you are doing? What does one of your requests look like? Do you have monitoring installed so you can share stats on indexing etc?

Here is a sample bulk fragment, note all other update requests look like this:

{
	"update" : {
		"_id" : "C5DBE5CD49E1425AABC146CDAEE7F3AF",
		"_index" : "backup_index",
		"_type" : "documents"
	}
}{
	"doc" : {
		"cont_length" : 0,
		"content_ukw2" : [""],
		"language_description2" : 0,
		"pagecount" : 1,
		"wordcount" : 0,
		"language" : 0,
		"doctype" : 3,
		"resolution" : 0,
		"thumb_place_x" : 90,
		"thumb_place_y" : 90,
		"image_orientation" : 0,
		"frames_count" : 4,
		"processed" : true
	}
}{
	"update" : {
		"_id" : "E566319089024F39A53F74E4E1AC6036",
		"_index" : "backup_index",
		"_type" : "documents"
	}
}{
	"doc" : {
		"cont_length" : 0,
		"content_en" : [""],
		"language_description2" : 0,
		"pagecount" : 1,
		"wordcount" : 0,
		"title" : "Time Matters",
		"title_sort_ducet" : "Time Matters",
		"language" : 5,
		"doctype" : 3,
		"resolution" : 0,
		"thumb_place_x" : 90,
		"thumb_place_y" : 90,
		"image_orientation" : 0,
		"frames_count" : 4,
		"processed" : true
	}
}

I am now sending 10000 requests per bulk and the tasks show requests running for over an hour without completing not sure I can wait any longer.

I don't have monitoring enabled. Will install it now and check.

This is the current state of my cluster. Do you see any abnormalities?

I have fiddled with the number of items per bulk request and the number of threads running. Now I am managing about 8000 update requests per 20 minutes. It is still very slow. I do not see the delays in the monitoring graphs.

I converted my code to do GET/INSERT instead of using Update or Bulk Update. I managed to update 1M documents in 10 minutes running 500 threads. This includes pulling the document, updating the fields and inserting it back to ES.
My Cluster was running mighty fine during this process and it would definitely accept more inserts if I wished to increase the number of threads. No issues with CPU, Disk or Memory

This leads me to believe that there is something really wrong with the way ES Update works.

This is my latest index info. Since I change the code everything is working smoothly and updating the documents only takes milliseconds. The red line indicates the time when I updated the code to use a get/insert combination instead of bulk updates.

I did not notice any changes in the index rate and I am not sure if ES counts updates with this rate as it should increase drastically.

I believe there is an issue with bulk updates and hope it gets fixed soon.

Have you done any profiling of this situation? For example, have you used the hot threads API to see what the shards are doing when executing the bulk requests? Have you attached a profiler to understand where the shard is spending it time when executing the bulk request? You should do this on a node holding a shard executing the bulk request, not the coordinating node receiving the bulk request.

Unfortunately I have migrated the code and since the issue is happening on my production cluster I cannot revert back to using bulk for testing purposes. I would have loved to have known about these sooner so I might help fix the issue.

Hey All,
The same problem:
Elasticsearch 5.2 with x-pack
AWS EC2 2 x i3.2xlarge RAM 61GB (31GB heap ), SSD
Ubuntu 16.04
A few indexes with 64 shards and 1 replica (360GB indexes size)
Too slow _bulk update with high CPU
_bulk indexing is fast

Example:

POST _bulk
{"update":{"_index":"cc_3","_type":"job","_id":"2124_cca74860dae5c0f7832d846823873808_228_i"}}
{"doc":{"additional_fields":null}}
{"update":{"_index":"cc_3","_type":"job","_id":"2124_cca74860dae5c0f7832d846823873808_228_i"}}
{"doc":{"additional_fields":{"ats":"none"}}}

Response took 300 - 600

Index document with _bulk took 5-6

On Elasticsearch 2.4.1 All works fine

https://www.evernote.com/l/ApaWh-26Q5FA1JS1frd1ezxkHhdhjRnrWFI

https://www.evernote.com/l/ApZiq2B2IdROzJtTnAWH23y_8-1TGspyBj8