Bulk update is too slow elasticsearch 6.2

Hi guys,

Hers is my configuration:

ES version = 6.2
JVM = 30gb
Ram = 128gb
CPU = 24core
SDK = PHP

I am experiencing Bulk update is too slow. I tried given solution which not working. Any suggestion will be appreciable Thanks.

1 Like

How are you updating the documents? What performance are you seeing? What is the size and structure of your documents? How frequently is an individual document updated?

I am doing bulk updating of 500 documents. Per document size is 32 kb approx. I am having queue system. So i wrote daemon which continuous pop the queue in batch of 500 and updating to ES.

1 Like

How many parallel update threads are you running? Are you using nested documents or parent-child? How many updates do you get per second?

20 php child process is running. There is some nested doc and main doc too which being update. I am also using script. Mostly updates are using 'doc_as_upsert'. If the batch size is 50 its hitting 4 bulk api request/sec. Which means 50*4 = 200 doc updating / sec. I would like to update approx 5k/persec.

What does CPU usage, disk I/O and iowait look like on the Elasticsearch host? How many indices and shards are you actively updating?

with 'iostat' its showing:
Linux 3.10.0-693.11.1.el7.centos.plus.x86_64
avg-cpu: %user %nice %system %iowait %steal %idle
2.71 0.48 0.43 0.02 0.00 96.37

Do you see anything in the Elasticsearch logs, e.g. around slow GC or merging falling behind?

How frequently is an individual document updated? => two update every 2 sec/doc
How many indices and shards are you actively updating? => 30 indices with 10 shards, 1replica
I am not seeing any log related to gc and merging. Should i increase the threads ? as the cpu idle is 96%. Or is their any other conf?

If you update a document that is still present in the transaction log and has not yet been written to a segment, this will trigger a refresh. If you're frequently updating the same documents, this will hurt performance as a refresh is an expensive operation.

How many nodes do you have in the cluster? How much data do you have in total?

1 Like

Yes i tried by disable/increase/decrease refresh_internal = 0/-1/30 but not got any success. I am having single node/cluster architecture . Right now i am having approx 10 million data which keep increasing.

As you are updating documents quite frequently, I do not necessarily think a longer refresh interval will help at all. It may actually be better to leave it at the default 1 second.

hmm .. okay So by adding node OR increasing threads will solve this problem ? OR is their any imp conf which i am missing ?

I do not have a lot of experience of update intensive use cases, so am not sure how to best optimise this. Let's see if someone else chimes in.

Hi i just noticed in my log In which i got 20 child process simultaneously send 4 bulk request/sec and 1 process also send 4 bulk req/sec. I think ES is not handling multiple connection for bulk . Is their any conf/setting for bulk api ?

Man, I do, and it wasn't a good experience! :wink:

If you are updating the same document multiple times per second, and you have many threads of execution updating documents, you will bury Elasticsearch. Occasional bursts of updates: fine. Constant low-frequency updates: fine. Constant high-frequency updates: not fine. And it gets less fine the larger the doc size.

I would suggest using a different approach to handle the frequent updates. Perhaps an in-memory cache that periodically (i.e., once every minute or two) updates a record in ES.

2 Likes

Thanks for reply loren. So is their any way where i can update in realtime. I have 2k to 3k per sec. How i can handle this in realtime OR it is not good practice to push data in realtime into elasticsearch ?

You can index tons of data into Elasticsearch very rapidly. Just not rapid updates, in my experience.

1 Like

Hi i guys, I debug and found the cause. I was heavily using script condition as well as update with upsert. So i just remove script condition and kept update query with upsert & got performance. But still m facing version control by running multiple threads . Any idea guys how i can tackle this ?

Hi guys, I got solution which helped me to increase my performance . You can check here https://gist.github.com/ashishtiwari1993/004a19f4a44efc214403a7fc1ee27cda#challenge-1-

1 Like