I have setup a cluster which indexes ~160GB of data per day into elasticsearch. I am currently facing this case where I need to update almost all the docs in all indices with a small amount of data(~16GB) per index which is of the format
My update operations start happening at 16000 lines per second and in 5 minutes it comes down to 1000 lines per second and doesnt go up after that. The way it stands now, the time for this update to happen is longer than my entire indexing process for 1 day
My conf file for the update operation currently looks as follows
Setting "indices.store.throttle.type" : "none"
Index "refresh_interval" : "-1"
I am running my cluster on 4 instances of the d2.8xlarge EC2 instances. I have allocated 30GB heap to each node. While the update is happening node cpu is barely used and the load is very less as well.
Is there something very obvious that I am missing that is causing this issue? While looking at the threadpool data I find that the number of threads working on bulk operations are constantly high.
Any help on this issue would be really helpful! Please let me know if you would need more info
Updates requires Elasticsearch to first find the document and then overwrite it, which tends to get slower the larger the shards gets. If you are reindexing and updating all the data in the index, it may make sense to write the updated data to a new index and then delete the original one when this has completed.
I am running elasticsearch 5.1 in all the nodes. Each index has 8 shards and 1 replica. So the size of one shard comes to ~20GB
I agree that updates involve first finding the document and then overwriting it which may potentially cause the system to be very slow. However the update initially starts happening at a very fast rate and then slows down tremendously.
I also ran 2 separate logstash instances to simultaneously update different indices. This is causing the write to happen at 500 lps per instance. And my cluster nodes are barely facing any load or using any cpu .
That looks quite good. I wonder if Logstash could be the bottleneck? What does the rest of your configuration look like? Which version of Logstash and Elasticsearch are you on?
I am running logstash 5.2 and Elasticsearch 5.1. I was actually hoping that the bottleneck might be logstash as it is a lot more easier to fix.
But then when I run the exact same logstash configuration with
output
{
null{}
}
The logstash process works at 36000 lines per second. So it doesnt seem to be the bottleneck here, Or is there something I might be overlooking?
Also one thing I missed mentioning is that if I run this on an empty index where it upserts documents, the same issue is happening. It runs at almost the same speed. However if it is changed from "update" to "index" the speed picks up and runs at 10000 lines per second
Have you tried increasing the the internal batch size within logstash (maybe to 1000?) to see if this has an impact? Does performance change if you set the refresh interval to 10s instead of -1?
As I said in my initial response, indexing into a new index may be considerably faster than updating the existing one.
I've tried all of those and none of it make any noticeable change played around with everything to the maximum.
I'm actually working on doing that right now. I'm merging my data locally and indexing again. It definitely is much faster than doing the update this way
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.