Hi guys,
Hers is my configuration:
ES version = 6.2
JVM = 30gb
Ram = 128gb
CPU = 24core
SDK = PHP
I am experiencing Bulk update is too slow. I tried given solution which not working. Any suggestion will be appreciable Thanks.
Hi guys,
Hers is my configuration:
ES version = 6.2
JVM = 30gb
Ram = 128gb
CPU = 24core
SDK = PHP
I am experiencing Bulk update is too slow. I tried given solution which not working. Any suggestion will be appreciable Thanks.
How are you updating the documents? What performance are you seeing? What is the size and structure of your documents? How frequently is an individual document updated?
I am doing bulk updating of 500 documents. Per document size is 32 kb approx. I am having queue system. So i wrote daemon which continuous pop the queue in batch of 500 and updating to ES.
How many parallel update threads are you running? Are you using nested documents or parent-child? How many updates do you get per second?
20 php child process is running. There is some nested doc and main doc too which being update. I am also using script. Mostly updates are using 'doc_as_upsert'. If the batch size is 50 its hitting 4 bulk api request/sec. Which means 50*4 = 200 doc updating / sec. I would like to update approx 5k/persec.
What does CPU usage, disk I/O and iowait look like on the Elasticsearch host? How many indices and shards are you actively updating?
with 'iostat' its showing:
Linux 3.10.0-693.11.1.el7.centos.plus.x86_64
avg-cpu: %user %nice %system %iowait %steal %idle
2.71 0.48 0.43 0.02 0.00 96.37
Do you see anything in the Elasticsearch logs, e.g. around slow GC or merging falling behind?
How frequently is an individual document updated? => two update every 2 sec/doc
How many indices and shards are you actively updating? => 30 indices with 10 shards, 1replica
I am not seeing any log related to gc and merging. Should i increase the threads ? as the cpu idle is 96%. Or is their any other conf?
If you update a document that is still present in the transaction log and has not yet been written to a segment, this will trigger a refresh. If you're frequently updating the same documents, this will hurt performance as a refresh is an expensive operation.
How many nodes do you have in the cluster? How much data do you have in total?
Yes i tried by disable/increase/decrease refresh_internal = 0/-1/30 but not got any success. I am having single node/cluster architecture . Right now i am having approx 10 million data which keep increasing.
As you are updating documents quite frequently, I do not necessarily think a longer refresh interval will help at all. It may actually be better to leave it at the default 1 second.
hmm .. okay So by adding node OR increasing threads will solve this problem ? OR is their any imp conf which i am missing ?
I do not have a lot of experience of update intensive use cases, so am not sure how to best optimise this. Let's see if someone else chimes in.
Hi i just noticed in my log In which i got 20 child process simultaneously send 4 bulk request/sec and 1 process also send 4 bulk req/sec. I think ES is not handling multiple connection for bulk . Is their any conf/setting for bulk api ?
Man, I do, and it wasn't a good experience!
If you are updating the same document multiple times per second, and you have many threads of execution updating documents, you will bury Elasticsearch. Occasional bursts of updates: fine. Constant low-frequency updates: fine. Constant high-frequency updates: not fine. And it gets less fine the larger the doc size.
I would suggest using a different approach to handle the frequent updates. Perhaps an in-memory cache that periodically (i.e., once every minute or two) updates a record in ES.
Thanks for reply loren. So is their any way where i can update in realtime. I have 2k to 3k per sec. How i can handle this in realtime OR it is not good practice to push data in realtime into elasticsearch ?
You can index tons of data into Elasticsearch very rapidly. Just not rapid updates, in my experience.
Hi i guys, I debug and found the cause. I was heavily using script condition as well as update with upsert. So i just remove script condition and kept update query with upsert & got performance. But still m facing version control by running multiple threads . Any idea guys how i can tackle this ?
Hi guys, I got solution which helped me to increase my performance . You can check here https://gist.github.com/ashishtiwari1993/004a19f4a44efc214403a7fc1ee27cda#challenge-1-
© 2020. All Rights Reserved - Elasticsearch
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries.