We manage a small cluster that has been collecting logs from Kubernetes for a few months which includes typical application JSON logs, but also kubernetes health metrics. These make up the vast majority of our records by quantity, and I'd be willing to bet by on-disk space as well.
I would like to remove all of these unwanted logs but leave the others untouched. I was experimenting with something like
But this might not do what I expect (it certainly seems limited to 1000 documents at a time which won't suffice for my 1bln+ records). While we research how to best never add these logs in the first place, what is the right strategy for removing the ones that match the above query?
Using delete by query API to delete 1bln+ documents is definitely a costly operation.
Worth considering reindexing the other documents in a new index may be instead?
I'm not at all experienced with ES, so that's possible. I'm not well versed in this system, and what I need to do to reduce the consumed space of uneeded logs that are intermixed in an index with ones I do want. Can you point me to an example or documentation that is relevant to my example? Even some pseudo-code of the actions I need to take so I have the proper terminology when learning more about it myself would be much appreciated!
For anyone else, I essentially did this. My query was a bit more involved since I had a couple different namespaces. I used the python library to make it easier but this did the trick.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.