We are observing a drop in ES index rate after every few hours. We have indexes created on daily basis and data is pushed into ES from logstash. In a day we expect 8 billion documents pushed to given day index. At times we see Index rate close to 600,000 /s for total shards and 350,000 /s for primary shards and then after sometime it gets dropped drastically to 22,000 /s. Any pointers that I can check for drop in Index rate after every few hours?
Which version of Elasticsearch are you using? 7.17.8
How many indices and shards are you actively indexing into? Currently indexing is going on 9 indexes created from 3 index templates. Index from template 1 has 30 shards and 1 replica, Index from template 2 has 15 shards and 1 replica, Index from template 3 has 10 shards and 1 replica
Is there any pattern around when you see the maximum throughput and when it is lower, e.g. starts out fast and gets gradually slower over time? When index rate drops drastically, I could see I/O utilization touches 100% on busy nodes.
Are you just indexing new data or also performing updtes and/or deletes? We are indexing new data
Are you using a custom ID or allowing Elasticsearch to set the ID? We are using custom id
What type of storage are you using? Local SSDs? HDD
Our more pointer I want to tell you our shard size is greater than recommended 20-50 GB shard size. Usually our per shard data is around 100 GB. Not sure whether it will help.
How many nodes performing indexing do you have in the cluster?
This seems a lot of shards to index into. It is possible that your bulk requests will result in a lot of small writes, which will result in higher IO.
If you are using custom IDs each indexing operation will need to br treated as a potential update and a search through the shard be performed before the data can be written. This results in higher IO and tends to lead to indexing throughput slowing down as shards grow in size.
I would not be surprised if the busy nodes are busy doing merging, which can use a lot of IO for larger shards. Indexing in general is very I/O intensive in Elasticsearch, which is why the docs recommend using SSDs. Writing to a lot of shards using custom IDs is not helping either as you are using HDDs. If you are not updating data and your data is immutable I would recommend letting Elasticsearch assign the document IDs as this will likely improve your throughput and reduce I/O load.
While doing _cat/tasks?detailed i could see that top tasks are related to forcemerge and bulk write has these information. So is the below information referring that merge operation is taking time.
indices:admin/forcemerge transport 1h Force-merge indices max-segments[3] onlyexpungedeletes[false] flush[true]
indices:data/write/bulk transport 19m requests[5000] indices[index name1,indexname2]
indices:data/write/bulk transport 19m requests[84] indices[index name1] [8]
As our index holds data in TBs for current day and even with the current no of primary shards our shard size is greater than recommended 50 GB limit. Taking less number of shards will increase the shard size.
Elasticsearch merge segments in the background. A fircemerge does as far as I know have to be invoked through the APIs. If you have ILM configured I believe this can be configured to perform a forcemerge, but otherwise it is probably invoked from your application at some point.
Just checked yes we have curator running for forcemerge. As we are lagging, so forcemerge started on index on which write operations are going on. Do you think cancelling force merge task using this API (POST _tasks/aitueURTbdu58VeiohTt8A:12345/_cancel) will have repercussion on running. As it is consuming resource and due to which we have drops in index rate.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.