Drop grok failures

I use Logstash to filter data from big files like mysqld.log and mongos.log. It receives thousands of entries from these logs and I use grok to filter data and display it in Kibana.

There are many entries that logstash receives that I don't really care about which why is these entries are failed to be parsed by grok and are given the _grokparsefailure. Those entries take a huge amount of data. I thought about having logstash deleting them right away but this would make debugging impossible.

Is it possible to have entries that are tagged _grokparsefailure deleted after 1 day? Thanks ahead!

I think the easiest way to delete them afterwards would be to create a cron job that calls the Delete by query API filtering the data with a term and a range query.

1 Like

Thanks for the response!

that requires the doc_id. Is it possible to generate a list of files with doc_ids of documents that are tagged __grokparsefailure and are older than a day?

You don't need a list of IDs for that API. Why do you think so?

Edit: Not tested (!), but your query would probably have look similar to this.

{
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "lt": "now-1d"
            }
          }
        },
        {
          "term": {
            "tags": "_grokparsefailure"
          }
        }
      ]
    }
  }
}
1 Like

Thanks for the response! If I understand right, it should be something like this?

curl -X POST "localhost:9200/filbeat-*/_delete_by_query?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match": {
      "tags": "_grokparsefailure"
    }
  }
}
'

If that's right, I guess the only thing parameters I couldn't find is the one that says if it's older than X days

Woah, that's incredible. Thank you so much!

Would it still be a curl POST request with _delete_by_query? Would I have to provide it certificate? :

curl -X POST "localhost:9200/filebeat-7.9.0/_delete_by_query?pretty" -d'
{
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "lt": "now-1d"
            }
          }
        },
        {
          "term": {
            "tags": "_grokparsefailure"
          }
        }
      ]
    }
  }
}'

Thanks ahead

Before testing I found a document that is recent and has _grokparsefailure. I ran this command:

curl -k -X POST "https://elastic:password@localhost:9200/cleandata/_delete_by_query?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "lt": "now-2d"
            }
          }
        },
        {
          "term": {
            "tags": "_grokparsefailure"
          }
        }
      ]
    }
  }
}'

And go this in response:

{
  "took" : 9382,
  "timed_out" : false,
  "total" : 76957,
  "deleted" : 76957,
  "batches" : 77,
  "version_conflicts" : 0,
  "noops" : 0,
  "retries" : {
    "bulk" : 0,
    "search" : 0
  },
  "throttled_millis" : 0,
  "requests_per_second" : -1.0,
  "throttled_until_millis" : 0,
  "failures" : [ ]
}

When I refreshed the document I saw it's still there : https://i.imgur.com/5PwNIGa.png

Did I do something wrong? Did I miss anything?

You're deleting documents that are older than two days and a document from today survived. Sounds like everything went according to plan?

1 Like

May the lord mercy my stupid soul. You're right. My bad. It worked!

Thank you SO MUCH Jenni!!

1 Like

Do you know if the syntax accept wildcards? I tested it on a demo environment and in production the index is different everyday. For example the index for today is filebeat-7.9.0-2020.09.01.

Do you know if it accepts things like:

@localhost:9200/filebeat* <- wildcard