Async delete by query in elastic search spawns never ending tasks

We have a lot of data in our index that is tagged by an ownerId which resembles a customer. Every customer has the ability to define its own retention period. After the retention period is passed all the documents that are older then this retention period should be deleted.

There is a daemon process that triggers the deletion of old data on a daily basis. This is done through a delete_by_query containing the ownerId and the time range for which data should be deleted. The query is pushed to our Elastic cluster in an async wait because we do now want to wait for task completion. (deletion can take from 1 minute up to a couple of hours depending on the case)

We now encountered the case where multiple queries were fired with the same ownerId. (We cannot guarantee that this doesn't happen) This results in multiple tasks in Elastic (equal to the amount of queries), because the same query is executed in the same time frame some of the tasks start producing a lot of versions conflicts, probably because the documents are already flagged for deletion by a previous task. At the same time we see that our network is being congested with API calls from and to different nodes within the cluster.

At the time being we have not found a way to stop or cancel these tasks and they just keep hogging bandwidth causing other services to time-out. The only way out was reindexing the impacted index, deleting it and then reindexing again.

We have played around with the conflict and refresh parameters but nothing resulted in the proper deletion of the documents and clean-up of the spawned tasks in Elastic. The only thing that worked was to set the conflict parameter to abort on conflict . Though this aborts the whole tasks instead of just skipping the documents with conflict. proceed on conflict also caused the requests to fail and hang the cluster in case we bumped up the number of identical queries.

We have tried to use ILM and rollover in the past but eventually reached a situation where we had thousands of shards which heavily impacted performance.

So we are left with 2 questions:

  • How do you properly delete a lot of data by query, where you cannot guarantee that the queries don't overlap.
  • Why is Elasticsearch not able to handle multiple delete_by_query tasks without getting stuck in a perpetual loop of spawning new tasks over and over.

Used version of Elastic is 6.8

Simplified code snippet in Go using gopkg.in/olivere/elastic.v6:

for i := 0; i < 1000; i++ {
    query := elastic.NewBoolQuery().Must(
        elastic.NewTermQuery("ownerId", ownerId),
        elastic.NewRangeQuery("timestamp").Lte(to).Gte(from),
    )
    _, err := elastic.NewDeleteByQueryService(esClient).
        Index("my-index").
        Query(query).
        ProceedOnVersionConflict().
        Refresh("true").
        DoAsync(context.Background())
    if err != nil {
        return err
    }
}

Pretty sure you asked this on SO, and my answer is the same as there unfortunately.

DBQ is really not designed to handle this approach of multiple, similar requests on the same index. If you use ILM to manage your indices+shards it should mitigate any issues there.

Using DBQ this way is inefficient compared to deleting data by deleting whole indices. If you still need to do this and can not have an index per user I would recommend trying to group users into different indices to lower the DBQ concurrency. I would also recommend queueing up delete tasks to make sure they are processed in sequence and avoid launching DBQ asynchronously without coordination.

1 Like

How would you address the sharding issue then? We already have 9 nodes running and everything I found about ES shards says you should try to stay under 500 shards because this can heavily impact performance. We came from over 2k shards in the past before merging the indices.

Grouping customers together by their retention period is not an option either, you probably also know that customers change their mind every week.

Per what though? The advice you will see here is <700-800 shards per node on a full <32GB heap.

Ultimately you need pay some kind of price here;

  1. Simplify your DBQ use so that it's only every single thread requests, which means managing that outside Elasticsearch (as it cannot do that for you)
  2. Put customers into their own indices and then manage the overhead there (IL Mwill help)
  3. Have set retention periods, eg weekly, fortnightly, monthly, etc etc and then let customers pick a specific period from those, then update ILM based on that
1 Like