We're using a deletebyQuery to remove a large number of records from Elasticsearch using a query similar to this:
let query = {
"query": {
"bool": {
"must_not": {
"ids": {
"values": all_ids
}
}
}
}
}
let result = await es.deleteByQuery({index: config.index, body: query})
Unfortunately every time this query is run our Elasticsearch cluster is hammered and query time goes crazy temporarily. We run this query about every 20 minutes.
Other than throw hardware at the problem (our cluster normally operates at 20% cpu utilization) is there something we can do to make this query less intensive?
We've tried passing tweaking deleteByQuery to be less intensive with no real result:
let result = await es.deleteByQuery({index: config.index, scroll_size: 500, requests_per_second:250, body: query})
let result = await es.deleteByQuery({index: config.index, requests_per_second: 500, body: query})
let result = await es.deleteByQuery({index: config.index, slices: auto, conflicts: abort, body: query})
Disabling the delete definitely stops the spikes. Can anyone offer any suggestions on how to make this less bursty - basically slow down the deletions so it doesn't spike Elasticsearch response time so badly?
thanks i figured DBQ = deleteByQuery. I'm reluctant to throw more hardware at the problem given we operate at 20% CPU usage - memory usage seems fine the majority of the time. It just seems like a Band-Aid but I am not sure if there's a better way to handle bulk deletes. Thanks for the advice.
Why are you deleting the data? ...does it need to be deleted.
The data being stored in the index is profile data from a dating site. The data is stored primarily in MSSQL and then replicated from MSSQL to Elasticsearch via a synchronization tool which was written in NodeJS. Whenever a user on our sites deactivates or delete their profiles that data needs to be removed from Elasticsearch or deleted profiles still show up on the website (which would make users not very happy - so yes it does need to be deleted)
What determines what needs to be deleted? Value / Time etc...
If I am understanding correctly the current query obtains a list of primary keys (ID's) from MSSQL then compares them to the equivalent ID fields in the appropriate index in Elasticsearch. If they don't exist then it deletes them. Essentially syncing MSSQL to Elasticsearch.
Obviously doing a record by record comparison between the two is "expensive" in terms of cpu and memory. I suspect a less intensive strategy would be to create a field in MSSQL to track when the profile was deleted then have the nodejs application read the last time it run from a file grab any newly deleted records and target delete them.
Could you / we perhaps devise an indexing strategy that would segment the data into indices that could be deleted?
I don't think so because its just random who deletes what. A user can delete their profile at any point. So I don't know how you'd logically group them into indexes where the entire index could be deleted.
What kind of storage is this data on?
Solid State Storage on a Dell Unity SAN connected via iSCSI over 10GbE networking. As best as I can tell I don't think the storage or network are the bottle neck.
Right now we have three master Elasticsearch nodes setup that are doing dual duty as masters and workers. (By the way this is all running as kubernetes pods if that matters).
We're considering trying to adjust our cluster to have standalone masters and dedicated workers. I believe that may be best practice although from reading through the helm chart documentation looks quasi-complicated to setup. And except when this query runs resource usage (cpu/memory) and response time looks perfectly reasonable.
I have some questions too.
What is the approximate size of all_ids?
How about the performance of the searching query without delete?
How many documents are to be deleted by single DBQ query on average?
If delete is heavy, is it possible to raise the frequency to reduce the number of documents to be deleted per query?
How about the performance of the searching query without delete?
Actually in looking at our logs it looks like there's a problem and it's not actually doing deletes. I am not sure how nobody has screamed about this sooner. It's getting 1.2 million records from MSSQL and comparing them to 1.2 M records in Elasticsearch and doing zero deletions. That will need to be fixed (but will likely just exacerbate the problem when it's actually doing deletions)
How many documents are to be deleted by single DBQ query on average?
My impression is that this is a per index setting. Looking that the index setting I don't see a max_terms_count set so I assume we are using the default. I can try reducing it. Any recommendations about what to set it to?
There is no need to change the default. But it is a contradition that the size of all_ids is about 1.2M and the maximum limit is 65,536 terms. The query might raise exception above the limit. 1.2M is only document count and is the cardinality of ids much smaller? This difference could have a big impact on this issue.
I tried and got an exception.
"root_cause" : [
{
"type" : "query_shard_exception",
"reason" : "failed to create query: The number of terms [4] used in the Terms Query request has exceeded the allowed maximum of [3]. This maximum can be set by changing the [index.max_terms_count] index level setting.",
"index_uuid" : "dS10vWDtRfmtY7MOHOac8g",
"index" : "test_max_terms_limit"
}
],
There is no need to change the default. But it is a contradition that the size of all_ids is about 1.2M and the maximum limit is 65,536 terms. The query might raise exception above the limit. 1.2M is only document count and is the cardinality of ids much smaller? This difference could have a big impact on this issue.
I am struggling to understand your question/statement but maybe some additional information will help.
From the MSSQL side when I issue this command: SELECT count(id) FROM PROFILES;
I get 1219391 records.
Likewise if I look at the Elasticsearch index I see:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open p2_profiles_v002 _vMPWFr9R3-fVJKQ_a-2oA 5 2 1025139 1601054 9.3gb 3.2gb
So in regards to your question about cardinality - I think there is roughly 1 million records being compared?
I see no errors in the Elasticsearch error logs for the pods.
I asked cardinality because there are possibility that one id has multiple documents and you may deduplicate them before querying.
Is it possible to debug on the client and get the actual length of all_ids? If it is truely 1.2M length and you have not changed the max_terms_limit settings of 65536, ES should return an error to the DBQ query. If there were no errors in ES, something is wrong. We must clarify to go ahead.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.