Drop grok failures

I use Logstash to filter data from big files like mysqld.log and mongos.log. It receives thousands of entries from these logs and I use grok to filter data and display it in Kibana.

There are many entries that logstash receives that I don't really care about which why is these entries are failed to be parsed by grok and are given the _grokparsefailure. Those entries take a huge amount of data. I thought about having logstash deleting them right away but this would make debugging impossible.

Is it possible to have entries that are tagged _grokparsefailure deleted after 1 day? Thanks ahead!

I think the easiest way to delete them afterwards would be to create a cron job that calls the Delete by query API filtering the data with a term and a range query.

Thanks for the response!

that requires the doc_id. Is it possible to generate a list of files with doc_ids of documents that are tagged __grokparsefailure and are older than a day?

You don't need a list of IDs for that API. Why do you think so?

Edit: Not tested (!), but your query would probably have look similar to this.

{
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "lt": "now-1d"
            }
          }
        },
        {
          "term": {
            "tags": "_grokparsefailure"
          }
        }
      ]
    }
  }
}

Thanks for the response! If I understand right, it should be something like this?

curl -X POST "localhost:9200/filbeat-*/_delete_by_query?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match": {
      "tags": "_grokparsefailure"
    }
  }
}
'

If that's right, I guess the only thing parameters I couldn't find is the one that says if it's older than X days

Woah, that's incredible. Thank you so much!

Would it still be a curl POST request with _delete_by_query? Would I have to provide it certificate? :

curl -X POST "localhost:9200/filebeat-7.9.0/_delete_by_query?pretty" -d'
{
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "lt": "now-1d"
            }
          }
        },
        {
          "term": {
            "tags": "_grokparsefailure"
          }
        }
      ]
    }
  }
}'

Thanks ahead

Before testing I found a document that is recent and has _grokparsefailure. I ran this command:

curl -k -X POST "https://elastic:password@localhost:9200/cleandata/_delete_by_query?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "lt": "now-2d"
            }
          }
        },
        {
          "term": {
            "tags": "_grokparsefailure"
          }
        }
      ]
    }
  }
}'

And go this in response:

{
  "took" : 9382,
  "timed_out" : false,
  "total" : 76957,
  "deleted" : 76957,
  "batches" : 77,
  "version_conflicts" : 0,
  "noops" : 0,
  "retries" : {
    "bulk" : 0,
    "search" : 0
  },
  "throttled_millis" : 0,
  "requests_per_second" : -1.0,
  "throttled_until_millis" : 0,
  "failures" : [ ]
}

When I refreshed the document I saw it's still there : https://i.imgur.com/5PwNIGa.png

Did I do something wrong? Did I miss anything?

You're deleting documents that are older than two days and a document from today survived. Sounds like everything went according to plan?

May the lord mercy my stupid soul. You're right. My bad. It worked!

Thank you SO MUCH Jenni!!

Do you know if the syntax accept wildcards? I tested it on a demo environment and in production the index is different everyday. For example the index for today is filebeat-7.9.0-2020.09.01.

Do you know if it accepts things like:

@localhost:9200/filebeat* <- wildcard