About 10 months ago we investigated adding Elasticsearch logging to a range of consumer electronics products to track which features were being used. I used a simplistic approach with a single index and wrote to it via the RESTFUL API. The results were extremely useful and Kibana gave us most of what we wanted out of the box so we didn’t take too much time analyzing whether the approach we had taken was appropriate and we didn’t have the need to study the stack in depth so my knowledge of it is extremely rudimentary.
We now have around 20,000 units and over 200 million documents but our Kibana visualizations have become almost unusably slow. Around 60-70% of the documents are superfluous as they record periods of inactivity which could be inferred from the other documents.
Although I realize our design is a little naïve and needs a rethink, the first step seems to be to remove the unnecessary documents. They are easily identifiable by a single value of a single field so I tried ‘_delete_by_query’. This is giving the result I need but is far too slow and doing experiments deleting a limited number of documents, I am finding that it takes around 20-30 secs per 1000 deletes which is over 30 days to delete them all! This is using default arguments except “size=10000” (“max_docs” was not accepted) and “wait_for_completion=false”). This is on a live server running 7.4.0
I did find another thread with a very similar question (https://discuss.elastic.co/t/slow-deletes/45597) which seemed to suggest there was no alternative.
Are there any arguments I could set to the delete that might speed it up?
Are there any admin tools that we could use that would be quicker even if it meant going offline to do it?
The thread linked above includes a possible technique of performing concurrent deletes but ensuring there are no collisions. The documents I want to delete have a common field value but will differ in other fields so is it viable to construct multiple _delete_by_query tasks that target specific sub-sets of the data?
Or perhaps a way to reindex it and ignore the unwanted documents?
Thanks.