Error : "tags" => [ [0] "_grokparsefailure" ]..Grok filter error

Hello,

I am using grok filters for parsing messages and I can see parsing failures but can't figure out what's causing the issue.

Grok Filter:

filter {
      # Create a copy of original message
      mutate {
        add_field => {
          "[@metadata][copyOfMessage]" => "%{[message]}"
        }
      }
      # split message
      mutate {
        split => {
          "[@metadata][copyOfMessage]" => "|"
        }
      }
      if [@metadata][copyOfMessage][6] =~ /^\/api\// {
        grok {
          break_on_match => false
          match => { "message" => "%{DATA:timestamp_local}\|%{NUMBER:duration}\|%{WORD:requesttype}\|%{IP:clientip}\|%{DATA:username}\|%{WORD:method}\|%{DATA:resource}\|%{DATA:protocol}\|%{NUMBER:statuscode}\|%{NUMBER:bytes}" }
        }
        grok {
          break_on_match => false
          match => { "resource" => "/%{DATA:dropFirstValue}/%{DATA:dropSecondValue}/%{DATA:repo}/%{GREEDYDATA:resource_path}" }
        }

# Drop the first two values when the resource_path starts with "/api/"
        mutate {
          remove_field => ["dropFirstValue", "dropSecondValue"]
        }
      }
      else{
        grok {
            # Enable multiple matchers
          break_on_match => false

          match => { "message" => "%{DATA:timestamp_local}\|%{NUMBER:duration}\|%{WORD:requesttype}\|%{IP:clientip}\|%{DATA:username}\|%{WORD:method}\|%{DATA:resource}\|%{DATA:protocol}\|%{NUMBER:statuscode}\|%{NUMBER:bytes}" }

            # Extract repo and path
          match => { "resource" => "/%{DATA:repo}/%{GREEDYDATA:resource_path}"}

        }
      }
# Extract resource name
      grok {
        break_on_match => false
        match => { "resource_path" => "(?<resource_name>[^/]+$)" }
      }
# Extract file extension
      grok {
        break_on_match => false
        match => { "resource_path" => "(?<resource_type>[^.]+$)" }
      }
# Parse date field
      date {
        timezone => "UTC"
        match => [ "timestamp_local" , "yyyyMMddHHmmss" ]
        target => "timestamp_object"
      }
      mutate {
        add_field => { "time" => "%{time}"}
      }
      ruby {
        code => "event.set('timestamp', event.get('timestamp_object').to_i * 1000);event.set('time',event.get('timestamp_object').to_i*1000000000 + rand(100000000))"
      }
    }

Message:
20190815175019|9599|REQUEST|14.56.55.120|anonymous|POST|/api/test/Lighter-test-group|HTTP/1.1|200|452

Output:

{
            "duration" => "9599",
           "timestamp" => 1565891419000,
                 "env" => "test",
                "time" => 1565891419078802209,
          "@timestamp" => 2019-08-15T19:08:04.339Z,
            "@version" => "1",
                "path" => "/testing/test.log",
            "username" => "anonymous",
                "host" => "myself-cg45",
            "resource" => "/api/test/Lighter-group",
     "timestamp_local" => "20190815175019",
              "method" => "POST",
          "statuscode" => "200",
             "message" => "20190815175019|9599|REQUEST|14.56.55.120|anonymous|POST|/api/test/Lighter-test-group|HTTP/1.1|200|452",
            "protocol" => "HTTP/1.1",
               "bytes" => "452",
                "site" => "XYZ",
                "tags" => [
        [0] "_grokparsefailure"
    ],
            "clientip" => "14.56.55.120",
         "requesttype" => "REQUEST",
    "timestamp_object" => 2019-08-15T17:50:19.000Z
}

I would not expect a split of a field in [@metadata] to work due to this issue.

Instead of /%{DATA:dropFirstValue}/%{DATA:dropSecondValue}/ and then removing the fields you can just use /%{DATA}/%{DATA}/. Personally I would use /[^/]+/[^/]+

Thanks for the info. what other possibilities do we have to extract a message from
20190815175019|9599|REQUEST|14.56.55.120|anonymous|POST|/api/test/Lighter-test-group|HTTP/1.1|200|452
and check if the 7th word starts with /api?

Actually the split works just fine. Your grok in the case where [resource] starts with /api assumes 4 parts to the path, but in that message it only has three. Perhaps

grok { match => { "resource" => "(/[^\/]+)?/[^\/]+/%{DATA:repo}/%{GREEDYDATA:resource_path}" } }

would work?

Actually the message format is variable and the resource part can have anywhere between 0-10 parts.

my understanding of the filter was that it drops the first two words no matter how many parts and if the message has 3 parts, the 3rd word would be parsed as repo and the 4th part would be an empty string

Is it possible to put a string length check here..like if the resource_path is 0 then display as empty string rather than a grok parse failure ?..or even give a default value to the field or remove that field?

Hello,

I tried different Regex but none seem to solve the issue..is there anyway in the grok filter to fix this issue ?

You haven't really explained what the issue is. You gave an example, for which I gave you a possible solution, then said there are other possible inputs. That is not enough information to offer a solution to.

I apologize for not providing the info required. I've used the below examples for testing

20190815175019|9599|REQUEST|14.56.55.120|anonymous|POST|/api/test/Lighter-test-group|HTTP/1.1|200|452
20190815175019|9599|REQUEST|14.56.55.120|anonymous|POST|/list/Lighter-test-group|HTTP/1.1|200|452
20190815175019|9599|REQUEST|14.56.55.120|anonymous|POST|/api/test/Lighter-test-group/2.0|HTTP/1.1|200|452
20190815175019|9599|REQUEST|14.56.55.120|anonymous|POST|/list/Lighter-test-group/xyz/123|HTTP/1.1|200|452
20190815175019|9599|REQUEST|14.56.55.120|anonymous|POST|/api/test/Lighter-test-group/2.0/1.2|HTTP/1.1|200|452

and the grok filter is

filter {
      # Create a copy of original message
      mutate {
        add_field => {
          "[@metadata][copyOfMessage]" => "%{[message]}"
        }
      }
      # split message
      mutate {
        split => {
          "[@metadata][copyOfMessage]" => "|"
        }
      }
      if [@metadata][copyOfMessage][6] =~ /^\/api\// {
        grok {
          break_on_match => false
          match => { "message" => "%{DATA:timestamp_local}\|%{NUMBER:duration}\|%{WORD:requesttype}\|%{IP:clientip}\|%{DATA:username}\|%{WORD:method}\|%{DATA:resource}\|%{DATA:protocol}\|%{NUMBER:statuscode}\|%{NUMBER:bytes}" }
        }
        grok {
          break_on_match => false
          match => { "resource" => "(/[^\/]+)?/[^\/]+/%{DATA:repo}/%{GREEDYDATA:resource_path}" }
        }

      }
      elseif [@metadata][copyOfMessage][6] =~ /^\/list\// or [@metadata][copyOfMessage][6] =~ /^\/simple\// {
        grok {
          break_on_match => false
          match => { "message" => "%{DATA:timestamp_local}\|%{NUMBER:duration}\|%{WORD:requesttype}\|%{IP:clientip}\|%{DATA:username}\|%{WORD:method}\|%{DATA:resource}\|%{DATA:protocol}\|%{NUMBER:statuscode}\|%{NUMBER:bytes}" }
        }
        grok {
          break_on_match => false
          match => { "resource" => "/%{DATA:dropFirstValue}/%{DATA:repo}/%{GREEDYDATA:resource_path}" }
        }

# Drop the first two values when the resource_path starts with "/list/" and "/simple/"
        mutate {
           remove_field => ["dropFirstValue"]
        }
      }
      else{
        grok {
            # Enable multiple matchers
          break_on_match => false

          match => { "message" => "%{DATA:timestamp_local}\|%{NUMBER:duration}\|%{WORD:requesttype}\|%{IP:clientip}\|%{DATA:username}\|%{WORD:method}\|%{DATA:resource}\|%{DATA:protocol}\|%{NUMBER:statuscode}\|%{NUMBER:bytes}" }

            # Extract repo and path
          match => { "resource" => "/%{DATA:repo}/%{GREEDYDATA:resource_path}"}

        }
      }
# Extract resource name
      grok {
        break_on_match => false
        match => { "resource_path" => "(?<resource_name>[^/]+$)" }
      }
# Extract file extension
      grok {
        break_on_match => false
        match => { "resource_path" => "(?<resource_type>[^.]+$)" }
      }
# Parse date field
      date {
        timezone => "UTC"
        match => [ "timestamp_local" , "yyyyMMddHHmmss" ]
        target => "timestamp_object"
      }
      mutate {
        add_field => { "time" => "%{time}"}
      }
      ruby {
        code => "event.set('timestamp', event.get('timestamp_object').to_i * 1000);event.set('time',event.get('timestamp_object').to_i*1000000000 + rand(100000000))"
      }
    }

From what I see the value of repo is getting parsed differently for different messages.

For messages starting with /api/,the 3rd word has to be a repo but

20190815175019|9599|REQUEST|14.56.55.120|anonymous|POST|/api/test/Lighter-test-group|HTTP/1.1|200|452
has a value of "test" as repo

and 20190815175019|9599|REQUEST|14.56.55.120|anonymous|POST|/api/test/Lighter-test-group/2.0|HTTP/1.1|200|452
has the right repo value which is "Lighter-test-group"

If i dont use the regex provided, if the resource has three parts, we have a grok parse failure but anything more than 3 parts is working fine

I've tried using

      match => { "resource" => "^(\/)+[^\/]+(\/)[^\/]+/%{DATA:repo}/%{GREEDYDATA:resource_path}" }

which would match /{firstword}/{secondword} but the same issue persists.

if the resource has three parts, we have a grok parse failure but anything more than 3 parts is working fine

I've looked at GreedyData and Data grok syntax which are

DATA .*?
GREEDYDATA .*

If they accept an empty string, I don't see why we are having a grok parse failure..is there any way to debug further?

so there is no grok parse failure if the message has trailing "/"
i.e,
/api/test/Lighter-test-group to /api/test/Lighter-test-group/
and
/list/Lighter-test-group to /list/Lighter-test-group/

from which I understand that empty string or space works fine with DATA/GREEDYDATA, but do we have a workaround solution on how to tackle this?

The only one of your examples that gets a _grokparsefailure is

        "resource" => "/list/Lighter-test-group",

To fix that change the second grok in the branch that handles /list/ to be

match => { "resource" => "(/%{DATA})?/%{DATA:repo}/%{GREEDYDATA:resource_path}" }

Grok filter:

  if [@metadata][copyOfMessage][6] =~ /^\/api\// {
    grok {
      break_on_match => false
      match => { "message" => "%{DATA:timestamp_local}\|%{NUMBER:duration}\|%{WORD:requesttype}\|%{IP:clientip}\|%{DATA:username}\|%{WORD:method}\|%{DATA:resource}\|%{DATA:protocol}\|%{NUMBER:statuscode}\|%{NUMBER:bytes}" }
    }
    grok {
      break_on_match => false
      match => { "resource" => "(/[^\/]+)?/[^\/]+/%{DATA:repo}/%{GREEDYDATA:resource_path}" }
    }

  }
  elseif [@metadata][copyOfMessage][6] =~ /^\/list\// or [@metadata][copyOfMessage][6] =~ /^\/simple\// {
    grok {
      break_on_match => false
      match => { "message" => "%{DATA:timestamp_local}\|%{NUMBER:duration}\|%{WORD:requesttype}\|%{IP:clientip}\|%{DATA:username}\|%{WORD:method}\|%{DATA:resource}\|%{DATA:protocol}\|%{NUMBER:statuscode}\|%{NUMBER:bytes}" }
    }
    grok {
      break_on_match => false
      match => { "resource" => "(/%{DATA})?/%{DATA:repo}/%{GREEDYDATA:resource_path}" }
    }
  }

Test messages:

20190815175019|9599|REQUEST|14.56.55.120|anonymous|POST|/api/test/Lighter-test-group|HTTP/1.1|200|452
20190815175019|9599|REQUEST|14.56.55.120|anonymous|POST|/list/Lighter-test-group|HTTP/1.1|200|452
20190815175019|9599|REQUEST|14.56.55.120|anonymous|POST|/api/test/Lighter-test-group/2.0|HTTP/1.1|200|452
20190815175019|9599|REQUEST|14.56.55.120|anonymous|POST|/list/Lighter-test-group/xyz/123|HTTP/1.1|200|452
20190815175019|9599|REQUEST|14.56.55.120|anonymous|POST|/api/test/Lighter-test-group/2.0/1.2|HTTP/1.1|200|452

the output:

For /api/test/Lighter-test-group/2.0 , repo= Lighter-test-group
For /api/test/Lighter-test-group , repo = test
For /list/Lighter-test-group , repo = list
For /api/test/Lighter-test-group/2.0/1.2 , repo = Lighter-test-group
For /list/Lighter-test-group/xyz/123 , repo = Lighter-test-group

There are no grokparsefailures, but messages are getting parsed differently. For messages starting with list or api..repo value is getting parsed differently. All the messages need to be parsed as repo=Lighter-test-group

In that case make the resource_path optional rather than the first element

 match => { "resource" => "/%{DATA}/%{DATA:repo}(/%{GREEDYDATA:resource_path})?" }

I tried this but I'm getting a grok parse failure

match => { "resource" => "/[^/]+/(?<repo>[^/]+)(/%{GREEDYDATA:resource_path})?" }

That extracts repo correctly. The later groks give a _grokparsefailure because resource_path does not exist in one case.

Grok filters:

For list:

      match => { "resource" => "/[^/]+/(?<repo>[^/]+)(/%{GREEDYDATA:resource_path})?" }

For api:

      match => { "resource" => "(/[^\/]+)?/[^/]+/(?<repo>[^/]+)(/%{GREEDYDATA:resource_path})?" }

I've used these filters with the same input requests

but for /list/Lighter-test-group and /api/test/Lighter-test-group

the resource and repo's are getting parsed but i still see a grok parse failure

sample output:

{
            "username" => "anonymous",
          "@timestamp" => 2019-08-21T13:29:26.860Z,
              "method" => "POST",
            "resource" => "/api/test/Lighter-test-group",
           "timestamp" => 1565891419000,
          "statuscode" => "200",
                "host" => "user",
            "protocol" => "HTTP/1.1",
    "timestamp_object" => 2019-08-15T17:50:19.000Z,
                "path" => "/Users/hack/test-arti.log",
                "repo" => "Lighter-test-group",
             "message" => "20190815175019|9599|REQUEST|14.56.55.120|anonymous|POST|/api/test/Lighter-test-group|HTTP/1.1|200|452",
     "timestamp_local" => "20190815175019",
                "time" => 1565891419093647427,
         "requesttype" => "REQUEST",
            "clientip" => "14.56.55.120",
               "bytes" => "452",
            "@version" => "1",
            "duration" => "9599",
                "tags" => [
        [0] "_grokparsefailure"
    ]
}

and I explained why in my previous post

Since we made the resource_path optional, is it expected to throw an error if doesnot exist ? Is there any way we can parse the repo's and remove the error message ?