Difference between filebeat parsing & stdin parsing of the same log format

Hi there,

I am having some difficultiy with a custom log format, when sending it into logstash via the beats input. If I send the same log line into logstash (with the same filters as the beats input) via stdin, the event is parsed perfectly.

Both pipelines get the event into Elasticsearch, but the filebeat pipeline fails to parse the log properly, adding a _grokparsefailure tag to the event; whereas the stdin does a perfect job of it.

This is the log line in question:

2016-06-11T16:28:33+10:00 60.241.43.104 200 GET /test.gif?v=1&_v=j44&a=1565401534&t=pageview&_s=1&dl=https%3A%2F%2Fwww.example.com%2F&ul=en-gb&de=UTF-8&dt=Title&sd=24-bit&sr=1280x720&vp=1263x614&je=0&fl=21.0%20r0&_u=QCCAAEABI~&jid=1451351253&cid=855412598.1465456935&tid=UA-28513095-1&_r=1&z=1190582191 HTTP/1.1 https://www.example.com/ "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36"

This is my filebeat.yml config:

filebeat:
  prospectors:
    -
      paths:
        - /var/log/nginx/custom.log
      input_type: log
      document_type: cx-gacm-log
output:
  logstash:
    hosts: ["127.0.0.1:31312"]
  console:
    pretty: true
shipper:
logging:
  files:
    path: /var/log/gacm-filebeat
    rotateeverybytes: 10485760 # = 10MB
  level: debug

Here is the filebeat logstash pipeline config:

    beats {
        port => 31312
    }
}
filter {

  if [type] == "cx-gacm-log" {
    grok {
      match => {
          "message" => "%{TIMESTAMP_ISO8601:timestamp} %{IPORHOST:remote_ip} %{NUMBER:status_code} %{WORD:request_action} %{URIPATH:collector_path}(?:%{URIPARAM:request_params})? HTTP/%{NUMBER:http_version} %{URIPROTO:referrer_proto}://(?:%{USER}(?::[^@]*)?@)?(?:%{URIHOST:referrer_host})?(?:%{URIPATHPARAM})? \"%{DATA:agent}\""
        }
    }
    date {
      match => [ "timestamp", "ISO8601" ]
    }
    geoip {
      source => "remote_ip"
      target => "geoip"
    }
    useragent {
      source => "agent"
      target => "user_agent"
    }
    # key value split for the parameters
    kv {
      field_split => "&?"
      source => "request_params"
    }
    mutate {
        #remove the fields we don't want
        remove_field => ["message", "request_params", "agent"]
    }
    urldecode {
      all_fields => true
    }
  }
}

output {
    elasticsearch {
        hosts => "localhost"
        index => "%{[@metadata][type]}-%{+YYYY.MM.dd}"
        document_type => "%{[@metadata][type]}"
    }
}

I use a different config file for the stdin pipeline just to make things a little easier, but it is essentially the same.

Here is the stdin pipeline config:

input {
    stdin {
        type => "cx-gacm-stdin"
    }
}
filter {

  if [type] == "cx-gacm-stdin" {
    grok {
      match => {
        "message" => "%{TIMESTAMP_ISO8601:timestamp} %{IPORHOST:remote_ip} %{NUMBER:status_code} %{WORD:request_action} %{URIPATH:collector_path}(?:%{URIPARAM:request_params})? HTTP/%{NUMBER:http_version} %{URIPROTO:referrer_proto}://(?:%{USER}(?::[^@]*)?@)?(?:%{URIHOST:referrer_host})?(?:%{URIPATHPARAM})? \"%{DATA:agent}\""
      }
    }

    date {
      match => [ "timestamp", "ISO8601" ]
    }

    geoip {
      source => "remote_ip"
      target => "geoip"
    }

    useragent {
      source => "agent"
      target => "user_agent"
    }

    # key value split for the parameters
    kv {
      field_split => "&?"
      source => "request_params"
    }

    mutate {
        #remove the fields we don't want
        remove_field => ["message", "request_params", "agent"]
    }
    urldecode {
      all_fields => true
    }
  }
}

output {
    elasticsearch {
        hosts => "localhost"
        index => "%{[@metadata][type]}-%{+YYYY.MM.dd}"
        document_type => "%{[@metadata][type]}"
    }
    if [type] == "cx-gacm-stdin" {
        stdout {
            codec => rubydebug
        }
    }
}

Grokdebug also validates and parses the log line fine, so I am thinking it is something to do with the filebeat, but there is nothing special in the setup, I didn't think.

Just to fill out the question, I am using all of the latest release versions of all of the stack (not 5.x), it is running on an ubuntu 14.04 box with plenty of resources. The log is a custom log format from nginx, and it is just logging a get on a single gif file.

You haven't defined the type on the beats input from what I can see.

Hi Mark,

thanks for the answer. :slight_smile:

However, this does not solve the problem - I updated the beats input to have the correct type, and I still get the issue.

Here is a single result from a sense query (sorting the timestamp value desc):

{
        "_index": "cx-gacm-log-2016.06.13",
        "_type": "cx-gacm-log",
        "_id": "AVVHgKgmNSUF38Jp_YCb",
        "_score": null,
        "_source": {
          "message": "2016-06-13T11:57:44+10:00 59.100.20.201 200 GET /gacm.gif?v=1&_v=j44&a=290976217&t=pageview&_s=1&dl=https%3A%2F%2Fwww.example.com.au%2F&ul=en-gb&de=UTF-8&dt=Musings%20of%20a%20tired%20man&sd=24-bit&sr=1280x720&vp=1263x658&je=0&fl=21.0%20r0&_u=QCCAAEABI~&jid=&cid=855412598.1465456935&tid=UA-28513095-1&z=1470696299 HTTP/1.1 https://www.example.com.au/ \"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36\"",
          "@version": "1",
          "@timestamp": "2016-06-13T02:04:18.102Z",
          "source": "/var/log/nginx/collector-analytics.log",
          "type": "cx-gacm-log",
          "count": 1,
          "beat": {
            "hostname": "continuity",
            "name": "continuity"
          },
          "offset": 15941,
          "input_type": "log",
          "fields": null,
          "host": "continuity",
          "tags": [
            "beats_input_codec_plain_applied",
            "_grokparsefailure"
          ]
        },
        "sort": [
          1465783458102
        ]
      }

Now, the message does have escaped quotes around the user agent portion of the log line - would that cause an issue and if so, how do I prevent that from occurring - is it a character encoding thing? Should I swap to single quotes? If the latter, how does one manage this in a scenario where they do not control the log format?

The beats protocol encodes events in JSON, and I suspect this is what caused the escaped quotes. I agree that the fact that this changes the log line could be problematic.

One could naturally try to url-decode the message string, but that has the potential to further change the log line, e.g. if parts of the original string was already url-encoded.

Yeah ok, interesting. Not sure url encoding will be the way to go - I do already have portions of the log line that are url encoded (uri parameters).

So, from filebeat, how is this handled with standard apache logs (and indeed in the Grok examples for nginx log parsing in the github)? I have a very similar use case, where a portion of the log is wrapped in double quotes - it seems strange to me that this is a problem, considering that particular use case.

Further to my earlier reply to your reply, Mark, the document_type is set in the filebeat.yml and it is my understanding that this is the type field that is used for the filter activation (eg: it is the type that is set within the actual event data). Is this correct?

Ok, so after investigating this issue all evening, I was tailing the syslog for any logs from filebeat and discovered that it had an issue sending a log and stated that there was an invalid token.

So it got me thinking about how I am forming the log line in nginx and there seems to have been a character in the log line that was causing some issues - not sure what it was, maybe a tab somewhere. Obviously when I was copying and pasting it around the character got dropped or converted or something so when pushing it through the stdin pipeline the grok parser had no problems.

Thanks for everyone's time on this.