Best way to delete large number of documents

About 10 months ago we investigated adding Elasticsearch logging to a range of consumer electronics products to track which features were being used. I used a simplistic approach with a single index and wrote to it via the RESTFUL API. The results were extremely useful and Kibana gave us most of what we wanted out of the box so we didn’t take too much time analyzing whether the approach we had taken was appropriate and we didn’t have the need to study the stack in depth so my knowledge of it is extremely rudimentary.

We now have around 20,000 units and over 200 million documents but our Kibana visualizations have become almost unusably slow. Around 60-70% of the documents are superfluous as they record periods of inactivity which could be inferred from the other documents.

Although I realize our design is a little naïve and needs a rethink, the first step seems to be to remove the unnecessary documents. They are easily identifiable by a single value of a single field so I tried ‘_delete_by_query’. This is giving the result I need but is far too slow and doing experiments deleting a limited number of documents, I am finding that it takes around 20-30 secs per 1000 deletes which is over 30 days to delete them all! This is using default arguments except “size=10000” (“max_docs” was not accepted) and “wait_for_completion=false”). This is on a live server running 7.4.0

I did find another thread with a very similar question (https://discuss.elastic.co/t/slow-deletes/45597) which seemed to suggest there was no alternative.

Are there any arguments I could set to the delete that might speed it up?

Are there any admin tools that we could use that would be quicker even if it meant going offline to do it?

The thread linked above includes a possible technique of performing concurrent deletes but ensuring there are no collisions. The documents I want to delete have a common field value but will differ in other fields so is it viable to construct multiple _delete_by_query tasks that target specific sub-sets of the data?

Or perhaps a way to reindex it and ignore the unwanted documents?

Thanks.

You could try expanding your cluster size temporarily, to give it more resources to process the deletes, then shrink it back.

As you are deleting a majority of documents it may be better to reindex the documents you want to keep into new indices and then simply delete the old indices. This might also allow you to switch to time-based indices if you are not already using this (which generally makes deleting old data much more efficient).

1 Like

Thanks for the replies.

Creating a new index was really what I was getting at with my last comment but I hadn't appreciated that you could effectively do a 'reindex by query' which I have now found in the docs.

I have started a reindex that excludes the superfluous documents and that seems to be working and should complete within 2-3 hours at current rate. Thanks.

How do I switch my clients from the old index (e.g. 'foobar') to the new one ('e.g foobar-lite') given they are actively writing to 'foobar' via the RESTFUL API. I have experimented with a local docker installation of ELK and it looks like creating an alias such that 'foobar' refers to 'foobar-lite' does what I want (of course this requires that I delete the old index first as it won't allow an alias with the same name as a physical index which will be nervous moment on the real server)

Is this a reasonable strategy?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.