ES Index Rate drops after every few hours

Hi,

We are observing a drop in ES index rate after every few hours. We have indexes created on daily basis and data is pushed into ES from logstash. In a day we expect 8 billion documents pushed to given day index. At times we see Index rate close to 600,000 /s for total shards and 350,000 /s for primary shards and then after sometime it gets dropped drastically to 22,000 /s. Any pointers that I can check for drop in Index rate after every few hours?

Thanks

Which version of Elasticsearch are you using?

How many indices and shards are you actively indexing into?

Is there any pattern around when you see the maximum throughput and when it is lower, e.g. starts out fast and gets gradually slower over time?

Are you just indexing new data or also performing updtes and/or deletes?

Are you using a custom ID or allowing Elasticsearch to set the ID?

What type of storage are you using? Local SSDs?

Please find answers below

Which version of Elasticsearch are you using? 7.17.8

How many indices and shards are you actively indexing into? Currently indexing is going on 9 indexes created from 3 index templates. Index from template 1 has 30 shards and 1 replica, Index from template 2 has 15 shards and 1 replica, Index from template 3 has 10 shards and 1 replica

Is there any pattern around when you see the maximum throughput and when it is lower, e.g. starts out fast and gets gradually slower over time? When index rate drops drastically, I could see I/O utilization touches 100% on busy nodes.

Are you just indexing new data or also performing updtes and/or deletes? We are indexing new data

Are you using a custom ID or allowing Elasticsearch to set the ID? We are using custom id

What type of storage are you using? Local SSDs? HDD

Our more pointer I want to tell you our shard size is greater than recommended 20-50 GB shard size. Usually our per shard data is around 100 GB. Not sure whether it will help.

How many nodes performing indexing do you have in the cluster?

This seems a lot of shards to index into. It is possible that your bulk requests will result in a lot of small writes, which will result in higher IO.

If you are using custom IDs each indexing operation will need to br treated as a potential update and a search through the shard be performed before the data can be written. This results in higher IO and tends to lead to indexing throughput slowing down as shards grow in size.

I would not be surprised if the busy nodes are busy doing merging, which can use a lot of IO for larger shards. Indexing in general is very I/O intensive in Elasticsearch, which is why the docs recommend using SSDs. Writing to a lot of shards using custom IDs is not helping either as you are using HDDs. If you are not updating data and your data is immutable I would recommend letting Elasticsearch assign the document IDs as this will likely improve your throughput and reduce I/O load.

While doing _cat/tasks?detailed i could see that top tasks are related to forcemerge and bulk write has these information. So is the below information referring that merge operation is taking time.

indices:admin/forcemerge transport 1h Force-merge indices max-segments[3] onlyexpungedeletes[false] flush[true]

indices:data/write/bulk transport 19m requests[5000] indices[index name1,indexname2]

indices:data/write/bulk transport 19m requests[84] indices[index name1] [8]

As our index holds data in TBs for current day and even with the current no of primary shards our shard size is greater than recommended 50 GB limit. Taking less number of shards will increase the shard size.

What is the size and specification of your cluster?

I saw you seem to be performing forcemerges. Is your data truly immutable or do you perform updates and/or deletes?

99% of operations are new index data and only in few scenarios we update or delete. I believe this forcemerge is done by ES to merge small segments.

Elasticsearch merge segments in the background. A fircemerge does as far as I know have to be invoked through the APIs. If you have ILM configured I believe this can be configured to perform a forcemerge, but otherwise it is probably invoked from your application at some point.

Just checked yes we have curator running for forcemerge. As we are lagging, so forcemerge started on index on which write operations are going on. Do you think cancelling force merge task using this API (POST _tasks/aitueURTbdu58VeiohTt8A:12345/_cancel) will have repercussion on running. As it is consuming resource and due to which we have drops in index rate.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.