We have a lot of data in our index that is tagged by an ownerId which resembles a customer. Every customer has the ability to define its own retention period. After the retention period is passed all the documents that are older then this retention period should be deleted.
There is a daemon process that triggers the deletion of old data on a daily basis. This is done through a delete_by_query containing the ownerId and the time range for which data should be deleted. The query is pushed to our Elastic cluster in an async wait because we do now want to wait for task completion. (deletion can take from 1 minute up to a couple of hours depending on the case)
We now encountered the case where multiple queries were fired with the same ownerId. (We cannot guarantee that this doesn't happen) This results in multiple tasks in Elastic (equal to the amount of queries), because the same query is executed in the same time frame some of the tasks start producing a lot of versions conflicts, probably because the documents are already flagged for deletion by a previous task. At the same time we see that our network is being congested with API calls from and to different nodes within the cluster.
At the time being we have not found a way to stop or cancel these tasks and they just keep hogging bandwidth causing other services to time-out. The only way out was reindexing the impacted index, deleting it and then reindexing again.
We have played around with the conflict
and refresh
parameters but nothing resulted in the proper deletion of the documents and clean-up of the spawned tasks in Elastic. The only thing that worked was to set the conflict parameter to abort on conflict
. Though this aborts the whole tasks instead of just skipping the documents with conflict. proceed on conflict
also caused the requests to fail and hang the cluster in case we bumped up the number of identical queries.
We have tried to use ILM and rollover in the past but eventually reached a situation where we had thousands of shards which heavily impacted performance.
So we are left with 2 questions:
- How do you properly delete a lot of data by query, where you cannot guarantee that the queries don't overlap.
- Why is Elasticsearch not able to handle multiple delete_by_query tasks without getting stuck in a perpetual loop of spawning new tasks over and over.
Used version of Elastic is 6.8
Simplified code snippet in Go using gopkg.in/olivere/elastic.v6:
for i := 0; i < 1000; i++ {
query := elastic.NewBoolQuery().Must(
elastic.NewTermQuery("ownerId", ownerId),
elastic.NewRangeQuery("timestamp").Lte(to).Gte(from),
)
_, err := elastic.NewDeleteByQueryService(esClient).
Index("my-index").
Query(query).
ProceedOnVersionConflict().
Refresh("true").
DoAsync(context.Background())
if err != nil {
return err
}
}