Spike in response time during deleteByQuery

bradptp · January 29, 2022, 9:38pm

We're using a deletebyQuery to remove a large number of records from Elasticsearch using a query similar to this:

    let query = {
        "query": {
            "bool": {
                "must_not": {
                    "ids": {
                        "values": all_ids
                    }
                }
            }
        }
    }

let result = await es.deleteByQuery({index: config.index, body: query})

Unfortunately every time this query is run our Elasticsearch cluster is hammered and query time goes crazy temporarily. We run this query about every 20 minutes.

Other than throw hardware at the problem (our cluster normally operates at 20% cpu utilization) is there something we can do to make this query less intensive?

We've tried passing tweaking deleteByQuery to be less intensive with no real result:

let result = await es.deleteByQuery({index: config.index, scroll_size: 500, requests_per_second:250, body: query})
let result = await es.deleteByQuery({index: config.index, requests_per_second: 500, body: query})
let result = await es.deleteByQuery({index: config.index, slices: auto, conflicts: abort, body: query})

Disabling the delete definitely stops the spikes. Can anyone offer any suggestions on how to make this less bursty - basically slow down the deletions so it doesn't spike Elasticsearch response time so badly?

Thanks
Brad

warkolm · January 29, 2022, 10:19pm

Welcome to our community!

Hardware, or don't use DBQ would be the only solutions here sorry. This is an intensive process.

bradptp · January 29, 2022, 10:26pm

don't use DBQ

What would be the alternative to using deleteByQuery?

warkolm · January 29, 2022, 10:50pm

Figure out another approach altogether.

Which would require digging into why you are using it, what your use case is, etc.

stephenb · January 29, 2022, 11:08pm

That is short hand for Delete By Query = DBQ

bradptp · January 30, 2022, 5:56pm

thanks i figured DBQ = deleteByQuery. I'm reluctant to throw more hardware at the problem given we operate at 20% CPU usage - memory usage seems fine the majority of the time. It just seems like a Band-Aid but I am not sure if there's a better way to handle bulk deletes. Thanks for the advice.

stephenb · January 30, 2022, 11:51pm

We go back to ....

Can you help us understand your use case?

Why are you deleting the data? ...does it need to be deleted.

What determines what needs to be deleted? Value / Time etc...

Could you / we perhaps devise an indexing strategy that would segment the data into indices that could be deleted?

What kind of storage is this data on?

Delete By Query can be (not always) one of the most expensive operations in Elasticsearch.

bradptp · January 31, 2022, 1:44pm

Why are you deleting the data? ...does it need to be deleted.

The data being stored in the index is profile data from a dating site. The data is stored primarily in MSSQL and then replicated from MSSQL to Elasticsearch via a synchronization tool which was written in NodeJS. Whenever a user on our sites deactivates or delete their profiles that data needs to be removed from Elasticsearch or deleted profiles still show up on the website (which would make users not very happy - so yes it does need to be deleted)

What determines what needs to be deleted? Value / Time etc...
If I am understanding correctly the current query obtains a list of primary keys (ID's) from MSSQL then compares them to the equivalent ID fields in the appropriate index in Elasticsearch. If they don't exist then it deletes them. Essentially syncing MSSQL to Elasticsearch.

Obviously doing a record by record comparison between the two is "expensive" in terms of cpu and memory. I suspect a less intensive strategy would be to create a field in MSSQL to track when the profile was deleted then have the nodejs application read the last time it run from a file grab any newly deleted records and target delete them.

Could you / we perhaps devise an indexing strategy that would segment the data into indices that could be deleted?

I don't think so because its just random who deletes what. A user can delete their profile at any point. So I don't know how you'd logically group them into indexes where the entire index could be deleted.

What kind of storage is this data on?

Solid State Storage on a Dell Unity SAN connected via iSCSI over 10GbE networking. As best as I can tell I don't think the storage or network are the bottle neck.

Right now we have three master Elasticsearch nodes setup that are doing dual duty as masters and workers. (By the way this is all running as kubernetes pods if that matters).

We're considering trying to adjust our cluster to have standalone masters and dedicated workers. I believe that may be best practice although from reading through the helm chart documentation looks quasi-complicated to setup. And except when this query runs resource usage (cpu/memory) and response time looks perfectly reasonable.

Tomo_M · January 31, 2022, 2:48pm

I have some questions too.
What is the approximate size of all_ids?
How about the performance of the searching query without delete?
How many documents are to be deleted by single DBQ query on average?

If delete is heavy, is it possible to raise the frequency to reduce the number of documents to be deleted per query?

bradptp · January 31, 2022, 3:12pm

What is the approximate size of all_ids ?

Seems to contain 1.2 million documents.

The other two questions I'll answer shortly - I need to run some tests and review the logs.

Tomo_M · January 31, 2022, 3:44pm

Millions of ids in a query is quite large.

Suppose you are using terms query, the default maximum limit is 65,536 terms. Have you changed the limit by index.max_terms_count setting?

bradptp · January 31, 2022, 3:47pm

How about the performance of the searching query without delete?

Actually in looking at our logs it looks like there's a problem and it's not actually doing deletes. I am not sure how nobody has screamed about this sooner. It's getting 1.2 million records from MSSQL and comparing them to 1.2 M records in Elasticsearch and doing zero deletions. That will need to be fixed (but will likely just exacerbate the problem when it's actually doing deletions)

How many documents are to be deleted by single DBQ query on average?

Presently zero. See above

bradptp · January 31, 2022, 3:57pm

Suppose you are using terms query, the default maximum limit is 65,536 terms. Have you changed the limit by index.max_terms_count setting?

My impression is that this is a per index setting. Looking that the index setting I don't see a max_terms_count set so I assume we are using the default. I can try reducing it. Any recommendations about what to set it to?

bradptp · January 31, 2022, 4:05pm

Also supping we set max_terms_count lower what would be the impact? Would the query just fail because it has too many ids? Or would it chunk them?

Tomo_M · January 31, 2022, 4:08pm

There is no need to change the default. But it is a contradition that the size of all_ids is about 1.2M and the maximum limit is 65,536 terms. The query might raise exception above the limit. 1.2M is only document count and is the cardinality of ids much smaller? This difference could have a big impact on this issue.

I tried and got an exception.

"root_cause" : [
      {
        "type" : "query_shard_exception",
        "reason" : "failed to create query: The number of terms [4] used in the Terms Query request has exceeded the allowed maximum of [3]. This maximum can be set by changing the [index.max_terms_count] index level setting.",
        "index_uuid" : "dS10vWDtRfmtY7MOHOac8g",
        "index" : "test_max_terms_limit"
      }
    ],

Identify the problem might be a first step.

bradptp · January 31, 2022, 4:13pm

Actually in looking at our logs it looks like there's a problem and it's not actually doing deletes.

The deletes aren't happening on the MSSQL side (nothing wrong with the Elasticsearch side other than the CPU and response time when its running)

bradptp · January 31, 2022, 4:20pm

There is no need to change the default. But it is a contradition that the size of all_ids is about 1.2M and the maximum limit is 65,536 terms. The query might raise exception above the limit. 1.2M is only document count and is the cardinality of ids much smaller? This difference could have a big impact on this issue.

I am struggling to understand your question/statement but maybe some additional information will help.

From the MSSQL side when I issue this command:
SELECT count(id) FROM PROFILES;

I get 1219391 records.

Likewise if I look at the Elasticsearch index I see:

health status index                           uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   p2_profiles_v002                _vMPWFr9R3-fVJKQ_a-2oA   5   2    1025139      1601054      9.3gb          3.2gb

So in regards to your question about cardinality - I think there is roughly 1 million records being compared?

I see no errors in the Elasticsearch error logs for the pods.

Tomo_M · January 31, 2022, 4:34pm

I asked cardinality because there are possibility that one id has multiple documents and you may deduplicate them before querying.

Is it possible to debug on the client and get the actual length of all_ids? If it is truely 1.2M length and you have not changed the max_terms_limit settings of 65536, ES should return an error to the DBQ query. If there were no errors in ES, something is wrong. We must clarify to go ahead.

Simple experiment to get error:

PUT /test_max_terms_limit/
{
  "mappings": {
    "properties": {
      "id":{
        "type":"keyword"
      }
    }
  },
  "settings": {
    "index.max_terms_count": 3
  }
}

POST /test_max_terms_limit/_doc
{
  "id":"foo"
}

GET /test_max_terms_limit/_search
{
  "query":{
    "terms":{
      "id":["foo","baa","foobaa","baafoo"]
    }
  }
}

POST /test_max_terms_limit/_delete_by_query
{
  "query":{
    "terms":{
      "id":["foo","baa","foobaa","baafoo"]
    }
  }
}

bradptp · January 31, 2022, 4:46pm

In our NodeJS code we have:
logger.info(`Got ${all_ids.length} IDs to keep from deletion`);

And that results in:
info: Got 1219339 IDs to keep from deletion

So I believe all_ids is 1.2M. I wonder where the error is going? It doesn't seem to be in the error logs when I do a kubectl logs pod_name

bradptp · January 31, 2022, 4:47pm

Let me try to try to manually run the query from postman. I was trying to do that yesterday with no success but I'll try again.

Topic		Replies	Views
Delete vs deletebyquery Elasticsearch	8	1241	July 25, 2022
DeleteRequest Throttling Elasticsearch	1	329	October 15, 2019
Slow deletes Elasticsearch	5	3278	July 5, 2017
Alternative for Delete By Query or Solution for Performance improvement Elasticsearch	6	1846	June 4, 2021
ES DeleteByQuery fails to delete records sometimes Elasticsearch	6	1009	March 13, 2017

Spike in response time during deleteByQuery

Related topics