Allowing long query to complete even if indexes are shriking


I have a use case with up to two billions events a day that are injected into an Elasticsearch. I opimized by making one index per hour (4 primary shards with 1 replica per shard) and setting a index lifecycle policy that shrink and forcemerge the indices when they are 3 hours old. This seems to work well.

However, some very long queries that span over the shrinking indices will have errors because of failing shards. Those shards do not exist because there were destroyed, I suppose, because when the query began the index existed and during the query, the index did finish its shrinking and was destroyed.

Is there a way to delay the index destruction until all queries are complete? In this case is there an automated way to just remove it as referenced for other queries so only the running queries knows about it and ignore its shrinked counterpart?



Why are you shrinking indices? I would recommend using rollover and write to the shards until they are a suitable size so you do not need to shrink them. Generating an index per hour seems wasteful and could be problematic if you have a long retention period.

Well typically, one hour is enough to hit the 50GB mark. When shriked (the criteria being 50GB per shard) they tipically requires 1 ou 2 shards which is not so much per day considering the gigantic amount of data we have.

I'll (re) take a look at rollover. Last time I checked I couldn't even delete an index because they were part of datastream or something like this and it's kind of a bummer to not be able to manipulate the indices one by one.

If I have a one shard index, can it take the same input throughput than a 4 shard one? If so, can I make a rollover rules that says that if the index shard are is above 50GB or if it's midnight, it's time to roll?

The benefit of rollover is that you can target a specific shard size. Each index can have multiple primary shards and it is common to set it up to match the number of hot nodes. The tradeoff is that rollover will not be aligned with specific times. Why do you need to roll over exactly at mignight?


So the overall goal is to have one month of retention with the less shard possible.

Considering the performance of the cluster and the throughput, we need at least three shards to be written in parallel.

If we apply the 50GB limit, each hour can be between 1 or 2 shards, never 3. So we can reduce the number of shards easily by shrinking.

So at one point we have to have 2 to 3 simultaneous shard being written but only 1 to 2 to be kept.

About the rolling at midnight there is no hard reason, it's just because it's more practical dealing with operative if we have to maintain indices (reindex, delete, etc).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.