Grok parse error on custom apache log file

Hello,

I'm trying to set up a filter for a custom apache log. The log entry has two different types of lines, as below:

www.bbb.aaa.mil 33.44.112.28 - - [13/Sep/2015:04:02:37 -0400] "HEAD /www/default.htm HTTP/1.1" 302 - "http://www.fullerton.edu/ord/resources/federal-agencies-list.asp" "Mozilla/4.0 (compatiable; MSIE 7.0; Windows NT 5.1) HiScan"

and the other type of line is like this, which has two ips following the hostname:

www.bbb.aaa.mil  11.22.33.44,33.44.112.28 - - [13/Sep/2015:04:02:37 -0400] "HEAD /www/default.htm HTTP/1.1" 302 - "http://www.fullerton.edu/ord/resources/thelist.asp" "Mozilla/4.0 (compatiable; MSIE 7.0; Windows NT 5.1) HiScan"

I used this grok debugger to help me try and figure out how to parse the line. It came back with this as the combo.

%{URIHOST} %{IP}, %{COMBINEDAPACHELOG}

So to troubleshoot of i've been running logstash with stdin and this conf file:

input { stdin { } }

filter {
  grok {
      match => { "message" => "%{URIHOST} %{IP}, %{COMBINEDAPACHELOG}" }
        } 
          date {
              match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
        }
}

output {
      elasticsearch { host => localhost }
      stdout { codec => rubydebug }
}

But as a result I still get a grok parse error: This is what it actually returns

{
       "message" => "www.bbb.aaa.mil 33.44.112.28 - - [13/Sep/2015:04:02:37 -0400] \"HEAD /www/default.htm HTTP/1.1\" 302 - \"http://www.fullerton.edu/ord/resources/federal-agencies-list.asp\" \"Mozilla/4.0 (compatiable; MSIE 7.0; Windows NT 5.1) HiScan\"",
      "@version" => "1",
    "@timestamp" => "2015-09-21T13:15:11.896Z",
          "host" => "happy.cnn.abc.nz",
          "tags" => [
        [0] "_grokparsefailure"
    ]
}

Anyone have any ideas on how to remedy the parse error?

Thank you

Logstash Version: 1.5.4

Dave

Your pattern only works if there are two IP addresses present, but the input line that fails only has one. To solve this you can e.g. specify two patterns and grok will try them in order and use the first that matches.

This following works and captures the virtualhost and both client IP addresses:

filter {
  grok {
    match => {
      "message" => [
        "%{URIHOST:vhost} +%{COMBINEDAPACHELOG}",
        "%{URIHOST:vhost} +%{IP:clientip},%{COMBINEDAPACHELOG}"
      ]
    }
  }
}

Magnus,

Thank you for responding. I've been trying to get a handle on the grok parse format. With this in mind instead of using the built in grok pattern %{COMBINEDAPACHELOG} I'm trying to create my own, just to verify my understanding. When I created my own patterns, I still get grok parse errors in the config test I'm running. Could you look at my patterns and tell me where I'm going wrong?

Here they are:

DoubleIP
"%{HOST:ServerName} %{IP}, %{IP} %{HTTPDATE} %{QUOTEDSTRING:RequestFirstLine} %{POSINT:HTTPStatus} %{URI:Referrer} %{QUOTEDSTRING:UserAgent}"'

SingleIP
"%{HOST:ServerName} %{IP} %{HTTPDATE} %{QUOTEDSTRING:RequestFirstLine} %{POSINT:HTTPStatus} %{URI:Referrer} %{QUOTEDSTRING:UserAgent}"

Pay attention to the whitespace. Your example log entries have no space after the comma that separates the two IP addresses, but there are two spaces after the hostname (i.e. before the IP addresses).

I feel like I'm close to understanding the syntax. But I don't get why you have the + sign in front of the %{combinedapachelog} and in front of the +%{IP:clientip},%{COMBINEDAPACHELOG} . Do you need the plus sign to handle when there are spaces in the expression?

In regular expressions plus signs mean "one or more occurrences of the preceding token". In this case the preceding token is a space, so it's a way to be more lax about the number of spaces and handle both one and two (and ten) occurrences of them.