Removing unneeded documents

jkaoiadjmdna · March 18, 2019, 4:21pm

We manage a small cluster that has been collecting logs from Kubernetes for a few months which includes typical application JSON logs, but also kubernetes health metrics. These make up the vast majority of our records by quantity, and I'd be willing to bet by on-disk space as well.

I would like to remove all of these unwanted logs but leave the others untouched. I was experimenting with something like

POST /_all/_delete_by_query
{
  "query" : {
		"match": {
			"kubernetes.namespace": {
				"query": "kube-system"
			}
		}
  }
}

But this might not do what I expect (it certainly seems limited to 1000 documents at a time which won't suffice for my 1bln+ records). While we research how to best never add these logs in the first place, what is the right strategy for removing the ones that match the above query?

dadoonet · March 18, 2019, 4:57pm

Using delete by query API to delete 1bln+ documents is definitely a costly operation.
Worth considering reindexing the other documents in a new index may be instead?

jkaoiadjmdna · March 18, 2019, 6:22pm

I'm not at all experienced with ES, so that's possible. I'm not well versed in this system, and what I need to do to reduce the consumed space of uneeded logs that are intermixed in an index with ones I do want. Can you point me to an example or documentation that is relevant to my example? Even some pseudo-code of the actions I need to take so I have the proper terminology when learning more about it myself would be much appreciated!

dadoonet · March 18, 2019, 7:30pm

I was thinking of something like this:

POST _reindex
{
  "source": {
    "index": "*"
  },
  "dest": {
    "index": "tmp_index_name"
  },
  "query": {
    "bool": {
      "must_not": [
        {
          "match": {
            "kubernetes.namespace": {
              "query": "kube-system"
            }
          }
        }
      ]
    }
  },
  "script": {
    "source": "ctx._index=\"new_\"+ctx._index",
    "lang": "painless"
  }
}

If you are using daily indices with a short time of retention, you can also just wait for the oldest indices to be removed, may be?

jkaoiadjmdna · April 10, 2019, 4:15pm

For anyone else, I essentially did this. My query was a bit more involved since I had a couple different namespaces. I used the python library to make it easier but this did the trick.

system · May 8, 2019, 4:15pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.