Bulk updates are extremely slow after upgrading to 5.2

Zaid_Amir · March 21, 2017, 3:18pm

I am having an issue after upgrading my cluster from ES 1.7 to 5.2. The reindexing and upgrade were done two days ago and I did not have an issue on 1.7 before upgrading.

The bulk update tasks are taking a long time to finish, sometimes it even reaches an hour. My cluster consists of 5 nodes running on AWS EC2 with 50 shards and 1 replica. All my EBS volumes are IOPS provisioned with 10000 IOPS.

The EC2 instances are m4.2xlarge with 8 vCPUs and 32GB of ram. The heap size is set to 15GB.

I am closely monitoring the nodes and I do not see anything that should cause the slowness. Searching is blazing fast, CPU and memory usage are well within the accepted ranges and the used heap percentage is around the 65% mark. CPU is around 20% average.

Iostat shows very normal behaviour:

Linux 3.13.0-107-generic (ip-10-0-5-168) 03/21/2017 x86_64 (16 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
29.28 0.00 0.26 0.59 0.07 69.80

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
xvdf 222.49 4962.23 3983.79 88110729 70737216

And this is the output for _cat/nodes:

ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.0.5.16 42 94 6 1.04 1.16 0.61 mdi - NODE_01
10.0.4.74 53 99 37 6.37 6.36 5.92 mdi - NODE_04
10.0.5.60 68 99 21 4.11 5.00 4.18 mdi - NODE_02
10.0.5.168 69 99 31 4.99 5.14 4.57 mdi * NODE_03
10.0.4.161 72 99 13 1.08 1.70 1.45 mdi - NODE_05

And this is a segment of the tasks API response:

Each bulk call consists of up to 100 update requests. Each request is about 300 bytes. and I have 10 threads running and sending bulk requests.

I have set the index refresh interval to -1 but there was no noticeable difference.

I do not understand why I am having this issue. What might be the cause of this. And how can I speed things up?

warkolm · March 21, 2017, 8:31pm

There are major changes in 5.2, including how we sync the translog, which may impact this.

A few things;

Why so many shards?
Have you tried increasing the bulk size?

Zaid_Amir · March 21, 2017, 8:45pm

The original 1.7 had the same amount of shards and we did not want to change it. Plus we want to maintain a single index so we created 50 shards to keep them small and to be able to scale horizonally in the future. Currently the index hold 75M documents and is about 200GB

Yes we tried increasing the bulk size and it made matters worse.

Zaid_Amir · March 22, 2017, 8:22am

I have changed the translog durability to async with a 30s interval. And still no change in performance

Christian_Dahlqvist · March 22, 2017, 8:35am

Do you by any chance have a lot of different types or non-uniform mappings that may cause frequent updates to mappings and the cluster state?

Zaid_Amir · March 22, 2017, 8:53am

I have 6 types. All have different mappings. Not sure what you mean by update mapping?

Christian_Dahlqvist · March 22, 2017, 8:58am

That should be fine. Exactly how are you performing the updates? What other bulk sizes did to try?

Zaid_Amir · March 22, 2017, 9:06am

The updates are partial doc updates. I have tried 100 bulk updates per requests which is my default. I tried setting it to 1000 per request and 50. No difference in performance whatsoever.

One extra thing I noticed today, not sure if relevant, is that almost all the bulk workload is dedicated to a single node. The node itself is not the master node:

node_name name active queue rejected
Node_06  bulk      5     0        0
Node_07  bulk     18     0        0
Node_04  bulk      5     0        0
Node_02  bulk     50    82       0
Node_05  bulk      9     0        0
Node_08  bulk      7     0        0
Node_03  bulk     13     0        0
Node_09  bulk      8     0        0
Node_01  bulk      0     0        0

Notice how Node_2 has 50 active bulks and 82 in queue yet the rest of the cluster nodes are working on a very small load.

And as you can see I also increased the number of nodes in the cluster from 5 to 9. Still no performance change.

Christian_Dahlqvist · March 22, 2017, 9:14am

What size are the documents you are updating? Are you using scripted updates? Have you tried increasing the number of threads issuing updates?

As Elasticsearch 5.x syncs to disk much more frequently than earlier versions in order to enhance durability, I would expect larger bulk sizes to improve performance, especially as you are indexing into a large number of shards. Does performance change if you try with an even larger bulk size, e.g. 10000?

Zaid_Amir · March 22, 2017, 9:24am

My average doc size is around 1KB. I do use scripted updates but only with update_by_query and i do not see anything alarming about those as all my other thread polls seems to work blazingly fast.

Initially i was trying with a single thread now i am running 10 threads. Performance improved but still the requests take a long time to process.

I will try increasing the bulk size and let you know.

Mark_Harwood · March 22, 2017, 9:26am

Is there perhaps a single monster-size doc that keeps growing?
I remember the story of a user with a large (>1GB !?) document that was continually being added to.

Zaid_Amir · March 22, 2017, 9:29am

No all documents are relatively the same size. I do have very few large documents but those are never updated

Christian_Dahlqvist · March 22, 2017, 9:44am

Can you perhaps show us exactly what you are doing? What does one of your requests look like? Do you have monitoring installed so you can share stats on indexing etc?

Zaid_Amir · March 22, 2017, 10:57am

Here is a sample bulk fragment, note all other update requests look like this:

{
	"update" : {
		"_id" : "C5DBE5CD49E1425AABC146CDAEE7F3AF",
		"_index" : "backup_index",
		"_type" : "documents"
	}
}{
	"doc" : {
		"cont_length" : 0,
		"content_ukw2" : [""],
		"language_description2" : 0,
		"pagecount" : 1,
		"wordcount" : 0,
		"language" : 0,
		"doctype" : 3,
		"resolution" : 0,
		"thumb_place_x" : 90,
		"thumb_place_y" : 90,
		"image_orientation" : 0,
		"frames_count" : 4,
		"processed" : true
	}
}{
	"update" : {
		"_id" : "E566319089024F39A53F74E4E1AC6036",
		"_index" : "backup_index",
		"_type" : "documents"
	}
}{
	"doc" : {
		"cont_length" : 0,
		"content_en" : [""],
		"language_description2" : 0,
		"pagecount" : 1,
		"wordcount" : 0,
		"title" : "Time Matters",
		"title_sort_ducet" : "Time Matters",
		"language" : 5,
		"doctype" : 3,
		"resolution" : 0,
		"thumb_place_x" : 90,
		"thumb_place_y" : 90,
		"image_orientation" : 0,
		"frames_count" : 4,
		"processed" : true
	}
}

I am now sending 10000 requests per bulk and the tasks show requests running for over an hour without completing not sure I can wait any longer.

I don't have monitoring enabled. Will install it now and check.

Zaid_Amir · March 23, 2017, 10:23am

This is the current state of my cluster. Do you see any abnormalities?

I have fiddled with the number of items per bulk request and the number of threads running. Now I am managing about 8000 update requests per 20 minutes. It is still very slow. I do not see the delays in the monitoring graphs.

Zaid_Amir · March 26, 2017, 3:27pm

I converted my code to do GET/INSERT instead of using Update or Bulk Update. I managed to update 1M documents in 10 minutes running 500 threads. This includes pulling the document, updating the fields and inserting it back to ES.
My Cluster was running mighty fine during this process and it would definitely accept more inserts if I wished to increase the number of threads. No issues with CPU, Disk or Memory

This leads me to believe that there is something really wrong with the way ES Update works.

Zaid_Amir · March 29, 2017, 12:17pm

This is my latest index info. Since I change the code everything is working smoothly and updating the documents only takes milliseconds. The red line indicates the time when I updated the code to use a get/insert combination instead of bulk updates.

I did not notice any changes in the index rate and I am not sure if ES counts updates with this rate as it should increase drastically.

I believe there is an issue with bulk updates and hope it gets fixed soon.

jasontedor · March 29, 2017, 2:57pm

Have you done any profiling of this situation? For example, have you used the hot threads API to see what the shards are doing when executing the bulk requests? Have you attached a profiler to understand where the shard is spending it time when executing the bulk request? You should do this on a node holding a shard executing the bulk request, not the coordinating node receiving the bulk request.

Zaid_Amir · March 29, 2017, 3:40pm

Unfortunately I have migrated the code and since the issue is happening on my production cluster I cannot revert back to using bulk for testing purposes. I would have loved to have known about these sooner so I might help fix the issue.

dizzzyroma · March 31, 2017, 12:44pm

Hey All,
The same problem:
Elasticsearch 5.2 with x-pack
AWS EC2 2 x i3.2xlarge RAM 61GB (31GB heap ), SSD
Ubuntu 16.04
A few indexes with 64 shards and 1 replica (360GB indexes size)
Too slow _bulk update with high CPU
_bulk indexing is fast

Example:

POST _bulk
{"update":{"_index":"cc_3","_type":"job","_id":"2124_cca74860dae5c0f7832d846823873808_228_i"}}
{"doc":{"additional_fields":null}}
{"update":{"_index":"cc_3","_type":"job","_id":"2124_cca74860dae5c0f7832d846823873808_228_i"}}
{"doc":{"additional_fields":{"ats":"none"}}}

Response took 300 - 600

Index document with _bulk took 5-6

On Elasticsearch 2.4.1 All works fine

https://www.evernote.com/l/ApaWh-26Q5FA1JS1frd1ezxkHhdhjRnrWFI

https://www.evernote.com/l/ApZiq2B2IdROzJtTnAWH23y_8-1TGspyBj8

Topic		Replies	Views
Elasticsearch bulk update is extremely slow Elasticsearch	11	11699	April 10, 2017
Elasticsearch 1.7 to 2.4 upgrade and slow indexing Elasticsearch	6	696	December 7, 2017
Bulk update is too slow elasticsearch 6.2 Elasticsearch	25	6828	June 4, 2018
Cluster extremely slow after many bulk indexes Elasticsearch	2	492	July 6, 2017
Significant time changes during index benchmarking Elasticsearch	4	417	July 6, 2017

Bulk updates are extremely slow after upgrading to 5.2

Related topics