Problem: Custom Filebeat Module to parse modified nginx log Grok Error

Hello All,

Long story made shorter. I've created on my network a DNS black hole for ad server dns resolution. It works perfectly. My DNS serves up a dummy internal address for all of the adhosts in a list.

But now I want to start Analysing these requests. So I created an Nginx configuration for logging that looks like this. And it works perfectly.

log_format adblock '[$time_local] $remote_addr $host "$request"';
server {
        ...
        access_log /var/log/nginx/adblock_access.log adblock;
        ...
}

The log sample is:

[02/Jul/2018:13:45:51 +1000] 192.168.0.55 s.webtrends.com "GET /js/webtrends.replicate.js HTTP/1.1"
[02/Jul/2018:13:45:52 +1000] 192.168.0.55 cdn-gl.imrworldwide.com "GET /novms/js/2/ggcmb510.js HTTP/1.1"
[02/Jul/2018:13:45:52 +1000] 192.168.0.55 secure-au.imrworldwide.com "GET /v60.js HTTP/1.1"

So All good here.

Now I've cloned the module nginx and I have renamed most vars to be adblock instead of nginx. All good. I'm seeing documents hit elasticsearch. So all comms are good. I am NOT using logstash. I'm going straight to elasticsearch.

But I have an issue. I get a grok error in filebeat. Which show up in elastic search. When I test my grok pattern in Kibana devtools. It all works. But when I move it over to the default.json file for ingestion it fails.

OK So here is the Kibana Dev Tools

Everything works in devtools.
FYI here is the groc pattern in text:

\[%{HTTPDATE:adblock.access.time}\] %{IP:adblock.access.requester} %{DATA:adblock.access.host} "%{GREEDYDATA:adblock.access.info}"

But When I transpose this to the default.json file and add the appropriate escapes as required. To this.

  "description": "Pipeline for parsing Tailored Nginx access logs used for adblocking.",
  "processors": [{
    "grok": {
      "field": "message",
      "patterns":[
        "\\[%{HTTPDATE:adblock.access.time}\\] %{IP:adblock.access.requester} %{DATA:adblock.access.host} \"%{GREEDYDATA:adblock.access.info}\""
        ],

I end up with a grok error on every document but I can't track down what I did wrong.
error.message Provided Grok expressions do not match field value: [[01/Jul/2018:21:48:53 +1000] 192.168.0.62 t.dailymail.co.uk \"POST /s/sa HTTP/1.1\"]

I can put the same message into the dev tools and it groks just fine.

Any Ideas?

Side note: Here is a sample query in DevTools that shows the issue.

GET /filebeat-6.3.0-*/_search
{
  "query": { 
    "match": { "fileset.module": "adblock"  }
    },
  "size": 1
}

Result:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 661,
    "max_score": 0.0111235855,
    "hits": [
      {
        "_index": "filebeat-6.3.0-2018.07.02",
        "_type": "doc",
        "_id": "ksLRWWQBU2zibu4UUVJD",
        "_score": 0.0111235855,
        "_source": {
          "offset": 205,
          "prospector": {
            "type": "log"
          },
          "source": "/var/log/nginx/adblock_access.log",
          "message": """[02/Jul/2018:10:24:08 +1000] 192.168.0.55 cdn-gl.imrworldwide.com "GET /novms/js/2/ggcmb510.js HTTP/1.1"""",
          "fileset": {
            "module": "adblock",
            "name": "access"
          },
          "error": {
            "message": """Provided Grok expressions do not match field value: [[02/Jul/2018:10:24:08 +1000] 192.168.0.55 cdn-gl.imrworldwide.com \"GET /novms/js/2/ggcmb510.js HTTP/1.1\"]"""
          },
          "input": {
            "type": "log"
          },
          "@timestamp": "2018-07-02T07:07:26.495Z",
          "beat": {
            "hostname": "dns",
            "name": "dns",
            "version": "6.3.0"
          },
          "host": {
            "name": "dns"
          }
        }
      }
    ]
  }
}

OK I've paired down the grok to a simple:

  "processors": [{
    "grok": {
      "field": "message",
      "patterns":[
        "(?<adblock.access.requester>[0-9]+)"
        ],
      "ignore_missing": true
    }

This fails as well. This is pure regexp. I'm only looking for a single number anywhere in the string. And this fails.

I'm clearly missing something when it comes to grok in filebeat modules. OR it's broken.

Hi, using the log you've provided, and also the line from the error message, I've composed this log file:

[02/Jul/2018:13:45:51 +1000] 192.168.0.55 s.webtrends.com "GET /js/webtrends.replicate.js HTTP/1.1"
[02/Jul/2018:13:45:52 +1000] 192.168.0.55 cdn-gl.imrworldwide.com "GET /novms/js/2/ggcmb510.js HTTP/1.1"
[02/Jul/2018:13:45:52 +1000] 192.168.0.55 secure-au.imrworldwide.com "GET /v60.js HTTP/1.1"
[01/Jul/2018:21:48:53 +1000] 192.168.0.62 t.dailymail.co.uk "POST /s/sa HTTP/1.1"

I've registered the following ingest pipeline in Elasticsearch:

PUT _ingest/pipeline/adblock
{
  "description": "Pipeline for parsing Tailored Nginx access logs used for adblocking.",
  "processors": [{
    "grok": {
      "field": "message",
      "patterns":[
        "\\[%{HTTPDATE:adblock.access.time}\\] %{IP:adblock.access.requester} %{DATA:adblock.access.host} \"%{GREEDYDATA:adblock.access.info}\""
        ]
    }
  }]
}

And, setting pipeline: adblock in filebeat.yml, it results in all the events correctly parsed by Elasticsearch:

{
  "_index": "filebeat-7.0.0-alpha1-2018.07.02",
  "_type": "doc",
  "_id": "LzQFW2QBqVyKDDXa_GOA",
  "_version": 1,
  "_score": null,
  "_source": {
    "offset": 0,
    "prospector": {
      "type": "log"
    },
    "source": "/Users/adrian/go/src/github.com/elastic/beats/filebeat/sample.log",
    "message": "[02/Jul/2018:13:45:51 +1000] 192.168.0.55 s.webtrends.com \"GET /js/webtrends.replicate.js HTTP/1.1\"",
    "input": {
      "type": "log"
    },
    "adblock": {
      "access": {
        "requester": "192.168.0.55",
        "host": "s.webtrends.com",
        "time": "02/Jul/2018:13:45:51 +1000",
        "info": "GET /js/webtrends.replicate.js HTTP/1.1"
      }
    },
    "@timestamp": "2018-07-02T12:44:35.258Z",
    "host": {
      "name": "Adrian-Elastic.local"
    },
    "beat": {
      "hostname": "Adrian-Elastic.local",
      "name": "Adrian-Elastic.local",
      "version": "7.0.0-alpha1"
    }
  },
  "fields": {
    "@timestamp": [
      "2018-07-02T12:44:35.258Z"
    ]
  },
  "sort": [
    1530535475258
  ]
}
{
  "_index": "filebeat-7.0.0-alpha1-2018.07.02",
  "_type": "doc",
  "_id": "MTQFW2QBqVyKDDXa_GOA",
  "_version": 1,
  "_score": null,
  "_source": {
    "offset": 205,
    "prospector": {
      "type": "log"
    },
    "source": "/Users/adrian/go/src/github.com/elastic/beats/filebeat/sample.log",
    "message": "[02/Jul/2018:13:45:52 +1000] 192.168.0.55 secure-au.imrworldwide.com \"GET /v60.js HTTP/1.1\"",
    "input": {
      "type": "log"
    },
    "adblock": {
      "access": {
        "requester": "192.168.0.55",
        "host": "secure-au.imrworldwide.com",
        "time": "02/Jul/2018:13:45:52 +1000",
        "info": "GET /v60.js HTTP/1.1"
      }
    },
    "@timestamp": "2018-07-02T12:44:35.258Z",
    "beat": {
      "hostname": "Adrian-Elastic.local",
      "name": "Adrian-Elastic.local",
      "version": "7.0.0-alpha1"
    },
    "host": {
      "name": "Adrian-Elastic.local"
    }
  },
  "fields": {
    "@timestamp": [
      "2018-07-02T12:44:35.258Z"
    ]
  },
  "sort": [
    1530535475258
  ]
}
{
  "_index": "filebeat-7.0.0-alpha1-2018.07.02",
  "_type": "doc",
  "_id": "MjQFW2QBqVyKDDXa_GOA",
  "_version": 1,
  "_score": null,
  "_source": {
    "offset": 297,
    "prospector": {
      "type": "log"
    },
    "source": "/Users/adrian/go/src/github.com/elastic/beats/filebeat/sample.log",
    "message": "[01/Jul/2018:21:48:53 +1000] 192.168.0.62 t.dailymail.co.uk \"POST /s/sa HTTP/1.1\"",
    "input": {
      "type": "log"
    },
    "adblock": {
      "access": {
        "requester": "192.168.0.62",
        "host": "t.dailymail.co.uk",
        "time": "01/Jul/2018:21:48:53 +1000",
        "info": "POST /s/sa HTTP/1.1"
      }
    },
    "@timestamp": "2018-07-02T12:44:35.258Z",
    "beat": {
      "hostname": "Adrian-Elastic.local",
      "name": "Adrian-Elastic.local",
      "version": "7.0.0-alpha1"
    },
    "host": {
      "name": "Adrian-Elastic.local"
    }
  },
  "fields": {
    "@timestamp": [
      "2018-07-02T12:44:35.258Z"
    ]
  },
  "sort": [
    1530535475258
  ]
}

I think there is a problem with your modifications to the nginx module. It has to be using a different pipeline. Can you paste the output of GET /_ingest/pipeline ?

That's a huge DUMP in my env over 7000 chars. Which exceeds limit of this forum.

Is there a way I can limit that to relevant information.

There is no current pipeline statement in my /etc/filebeat/filebeat.yml.

So I'm not sure where this would go.

I'm running 6.3 and I see no reference to lead me to the proper section to place this in.

Ok there seems to be documentation for putting this into the output section.

However I still need to process regular filebeat events in other modules. There also seems to be a if else sorta construct.

Is it possible you can give me an example syntax that would enable normal and my new adblock?

No, don't mess with filebeat.yml. I did it that way to avoid having to create a module myself.

Can you paste your full ingest/default.json ?

Here

{
  "description": "Pipeline for parsing Tailored Nginx access logs used for adblocking.",
  "processors": [{
    "grok": {
      "field": "message",
      "patterns":[
        "\\[%{HTTPDATE:adblock.access.time}\\] %{IP:adblock.access.requester} %{DATA:adblock.access.host} \"%{GREEDYDATA:adblock.access.info}\""
        ],
      "pattern_definitions": {
        "IP_LIST": "%{IP}(\"?,?\\s*%{IP})*"
      },
      "ignore_missing": true
    }
  }, {
    "grok": {
      "field": "adblock.access.info",
      "patterns": [
          "%{WORD:adblock.access.method} %{DATA:adblock.access.url} HTTP/%{NUMBER:adblock.access.http_version}"      ],
      "ignore_missing": true
    }
  }, {
    "remove": {
      "field": "adblock.access.info"
    }
  }, {
    "script": {
      "lang": "painless",
      "inline": "boolean isPrivate(def ip) { try { StringTokenizer tok = new StringTokenizer(ip, '.'); int firstByte = Integer.parseInt(tok.nextToken());       int secondByte = Integer.parseInt(tok.nextToken());       if (firstByte == 10) {         return true;       }       if (firstByte == 192 && secondByte == 168) {         return true;       }       if (firstByte == 172 && secondByte >= 16 && secondByte <= 31) {         return true;       }       if (firstByte == 127) {         return true;       }       return false;     } catch (Exception e) {       return false;     }   }   def found = false;   for (def item : ctx.adblock.access.remote_ip_list) {     if (!isPrivate(item)) {       ctx.adblock.access.remote_ip = item;       found = true;       break;     }   }   if (!found) {     ctx.adblock.access.remote_ip = ctx.adblock.access.remote_ip_list[0];   }"
      }
  }, {
    "remove":{
      "field": "message"
    }
  }, {
    "rename": {
      "field": "@timestamp",
      "target_field": "read_timestamp"
    }
  }, {
    "date": {
      "field": "adblock.access.time",
      "target_field": "@timestamp",
      "formats": ["dd/MMM/YYYY:H:m:s Z"]
    }
  }, {
    "remove": {
      "field": "adblock.access.time"
    }
  }],
  "on_failure" : [{
    "set" : {
      "field" : "error.message",
      "value" : "{{ _ingest.on_failure_message }}"
    }
  }]
}

The IPLIST is pointless.. I will nuke it.

I reproduced your steps, copying the nginx module into a new adblock module and everything is working fine, the ingest pipeline is loaded automatically and there are no grok parsing errors.

is the module enabled?

is your module's manifest.xml pointing to the right default.json?

Yep the Module is enabled.

And yes the correct default.json is referenced.

The issue gets odder. I tried a version of the grok pattern that was simply a regexp looking for any number. So no grok patterns. And that failed as well.

I'm developing a suspicion that the install of filebeat is somehow corrupt. This host has had filebeat installed on it from a very early version and I have upgraded it over time. I'm going to tar up the files and purge it from the system and do a re-install of filebeat. With clean configs etc.

OK I have done a clean install of filebeat.

I've configured it to talk to my elasticsearch node.

I cloned the nginx module. Replacing all nginx refs to adblock.

I enabled the new module in /etc/filebeat/modules.d

The ONLY module enabled is my new one. I know it is working because the logs clearly show it's ingesting /var/log/nginx/adblock_access.log. And I see the documents in kibana.

I modified default.json to include my pattern. And double checked adblock has replaced nginx everywhere.

But I still get a grok error. And again the grok error does not occur when I test it with devtools.

Can you send me your version of default.json.

So I can compare. Because I'm seriously at a loss as to what the hell is going on.

My default.json with a very simply grok pattern. Still doesn't work. Same error as above.

{
  "description": "Pipeline for parsing Nginx access logs. Requires the geoip and user_agent plugins.",
  "processors": [{
    "grok": {
      "field": "message",
      "patterns":[
        "%{GREEDYDATA:adblock.access.info}"
        ],
      "ignore_missing": true
    }
  }, {
    "grok": {
      "field": "adblock.access.info",
      "patterns": [
          "%{WORD:adblock.access.method} %{DATA:adblock.access.url} HTTP/%{NUMBER:adblock.access.http_version}",
          ""
      ],
      "ignore_missing": true
    }
  }, {
    "remove": {
      "field": "adblock.access.info"
    }
  }, {
    "split": {
      "field": "adblock.access.remote_ip_list",
      "separator": "\"?,?\\s+"
    }
  }, {
    "script": {
      "lang": "painless",
      "inline": "boolean isPrivate(def ip) { try { StringTokenizer tok = new StringTokenizer(ip, '.'); int firstByte = Integer.parseInt(tok.nextToken());       int secondByte = Integer.parseInt(tok.nextToken());       if (firstByte == 10) {         return true;       }       if (firstByte == 192 && secondByte == 168) {         return true;       }       if (firstByte == 172 && secondByte >= 16 && secondByte <= 31) {         return true;       }       if (firstByte == 127) {         return true;       }       return false;     } catch (Exception e) {       return false;     }   }   def found = false;   for (def item : ctx.adblock.access.remote_ip_list) {     if (!isPrivate(item)) {       ctx.adblock.access.remote_ip = item;       found = true;       break;     }   }   if (!found) {     ctx.adblock.access.remote_ip = ctx.adblock.access.remote_ip_list[0];   }"
      }
  }, {
    "remove":{
      "field": "message"
    }
  }, {
    "rename": {
      "field": "@timestamp",
      "target_field": "read_timestamp"
    }
  }, {
    "date": {
      "field": "adblock.access.time",
      "target_field": "@timestamp",
      "formats": ["dd/MMM/YYYY:H:m:s Z"]
    }
  }, {
    "remove": {
      "field": "adblock.access.time"
    }
  }, {
    "user_agent": {
      "field": "adblock.access.agent",
      "target_field": "adblock.access.user_agent"
    }
  }, {
    "remove": {
      "field": "adblock.access.agent"
    }
  }, {
    "geoip": {
      "field": "adblock.access.remote_ip",
      "target_field": "adblock.access.geoip"
    }
  }],
  "on_failure" : [{
    "set" : {
      "field" : "error.message",
      "value" : "{{ _ingest.on_failure_message }}"
    }
  }]
}

SOLVED

The secret sauce is Drum roll.

You have to delete the injest pipeline in elasticsearch in order for a modified one to be loaded. I had the right grok pattern the whole time. The issue was the very first time filebeat loaded default.json I still had the old nginx pattern in there.

Wow this really needs to have better documentation. Because it was NOT AT ALL obvious.

What I should have been doing from the start was running this command in devtools each and everytime I modified the ingest pipeline.
DELETE _ingest/pipeline/filebeat-6.3.0-adblock-access-default

I would like to thank everyone that helped here and over chat trying to narrow this down. I finally figured it out when I saw a hint at the issue in a video.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.