Parsing HTTP logs


(Amarender Kasireddy) #1

Hello All,
I'm trying to parse http logs using filebeat and am running into some issues parsing the ip information.
We have the X-FORWARDED-FOR header enabled, which is sending two ip's for some of the calls we get (below is a sample call).

12.234.21.234, 123.34.567.32 - - [08/Jan/2018:00:00:26 -0500] RspTime= 75484 microsecond + "POST /hubcosmosint/DealerAdmin HTTP/1.1" 200 492 d-dummy-101:1234 "-" "-"

below is the json file we have to parse the data :
{
"description": "Parse HTTP Access Logs",
"processors": [
{
"grok" : {
"field" : "message",
"patterns" : [
"%{IPORHOST:client} -.*- \[%{HTTPDATE:ts}\] (?:RspTime\= %{NUMBER:timetaken} microsecond) %{NOTSPACE:connstatus} "(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion
})?|%{DATA:rawrequest})" %{NUMBER:response} (?:%{NUMBER:bytes}|-) (?:%{HOSTNAME:server})(?:\:%{NUMBER:portnumber}) "%{DATA:referer}" "(?:%{DATA:UserAgent})""
]
}
}
]
}

This works fine for the logs which has call only with single ip but for the logs which has calls with two ip's it is not able to parse the data.

I'm using Grok debugger and see that the below line can be used to parse this data with two ip's

(?%{IP}(, %{IP})*)

But, I unable to add this line to the json file without any exceptions.

Can someone please help me with this?

Thanks,
Amar


(Steffen Siering) #2

Is this nginx logs?

Have a look at the nginx filebeat module: https://github.com/elastic/beats/blob/master/filebeat/module/nginx/access/ingest/default.json#L4

It uses grok to collect the list of IPs into nginx.access.remote_ip_list + the split processor to create an array. The painless script iterates the IP list, to find the first non-local-IP that can be used as nginx.access.remote_ip, which is also used with GeoIP.


(Amarender Kasireddy) #3

Thanks for the response, Steffen.

These are custom http logs.

We were able to Parse the below log using a custom GROK pattern ("%{IPORHOST:client} -.*- [%{HTTPDATE:ts}] (?:RspTime= %{NUMBER:timetaken} microsecond) %{NOTSPACE:connstatus} "(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion
})?|%{DATA:rawrequest})" %{NUMBER:response} (?:%{NUMBER:bytes}|-) (?:%{HOSTNAME:server})(?::%{NUMBER:portnumber}) "%{DATA:referer}" "(?:%{DATA:UserAgent})"").

123.34.567.32 - - [08/Jan/2018:00:00:26 -0500] RspTime= 75484 microsecond + "POST /hubcosmosint/DealerAdmin HTTP/1.1" 200 492 d-dummy-101:1234 "-" "-"

Recently there was a change and multiple IP addresses were coming in the first field seperated by commas and filebeat stopped parsing it.

Below is an example of the new log :

12.234.21.234, 123.34.567.32 - - [08/Jan/2018:00:00:26 -0500] RspTime= 75484 microsecond + "POST /hubcosmosint/DealerAdmin HTTP/1.1" 200 492 d-dummy-101:1234 "-" "-"

I'm looking for a grok pattern to parse the above log.


(Steffen Siering) #4

Please properly format JSON, logs and configs using the </>-Button.

Have a look at the nginx module. Nginx logging can have multiple IPs, just as you have. You just need to copy part of the nginx grok pattern (+ optionally processing).


(Abhilash Usha) #5

Thanks Steffens

Following is what I did

Grok Pattern Working Logs

172.27.81.113, 192.34.56.67 - - [07/Jan/2018:19:00:30 -0500] RspTime= 555 microsecond + "GET / HTTP/1.1" 200 3493 - "-" "-"

GrokPattern Non Working Log

      • [07/Jan/2018:19:00:30 -0500] RspTime= 666 microsecond + "GET / HTTP/1.1" 600 6493 - "-" "-"
        Grok Pattern I have:
        {
        "description": "Parse HTTP Access Logs",
        "processors": [
        {
        "grok" : {
        "field" : "message",
        "patterns" : [
        "%{NOTSPACE:client} %{NOTSPACE:ident} %{NOTSPACE:auth} [%{HTTPDATE:ts}] (?:RspTime= %{NUMBER:timetaken} microsecond) %{NOTSPACE:connstatus} "(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})" %{NUMBER:response} (?:%{NUMBER:bytes}|-) (?(?:%{HOSTNAME:server})(?::%{NUMBER:portnumber})|-) "(?:%{DATA:referer}|-)" "(?:%{DATA:UserAgent}|-)""
        ],

I tried (?:%{IPORHOST:client}|-) and %{NOTSPACE:client} , still I face issues with parsing the log which has the first field as -


(Steffen Siering) #6

@abhilashusha I'm not sure if your grok pattern/problem matches the original discussion. The OP did filter out all fields before the HTTP timestamp, while you seam trying to want to parse those + have to deal with IPs. It's a subtle difference, which warrants another discussion.


(Abhilash Usha) #7

The issue I'm having here is when the first field ( IP address) comes as empty( -) . Everything works fine if the first field is a valid IP address or a string of IPS seperated by commas.


(Abhilash Usha) #8

@steffens , we changed the pattern a little . You can refer to the latest one.


(Steffen Siering) #9

I see.

Can you include the complete grok definition? Please format using the </>-Button at the top of the text box (reading json/configs is easier when formatted).

You can have anything between 0 and N ips. The IPORHOST pattern does not cover all of these.

I guess logs always have a space character after the iplist, which can either be -, IP or IP1, IP2, IP3, ...?

Some sample logs + grok (formatted to use with simulate API) would be helpful. Not having the full picture it's somewhat hard to make any recommendation. Anyways, I would define OPT_ADDRESS_LIST to be (-|IPORHOST(, IPORHOST)*). This regular expression parses either - (the empty list) or a sequence of IPORHOST separated by ,<space>.

Also have the grok pattern defined to be ^%{OPT_ADDRESS_LIST:client}.

There might be other errors in the grok pattern. If I'm unsure I clear my grok pattern and constructed incrementally. E.g. start with patterns: ["^%{OPT_ADDRESS_LIST:client} .*"]. If this works, add the next item to be parsed.


(Abhilash Usha) #10

Thanks for looking into the issue.

{
"description": "Parse HTTP Access Logs",
"processors": [
{
"grok" : {
"field" : "message",
"patterns" : [
"%{NOTSPACE:client} %{NOTSPACE:ident} %{NOTSPACE:auth} [%{HTTPDATE:ts}] (?:RspTime= %{NUMBER:timetaken} microsecond) %{NOTSPACE:connsta
tus} "(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})" %{NUMBER:response} (?:%{NUMBER:bytes}|-) (?(?:%{
HOSTNAME:server})(?::%{NUMBER:portnumber})|-) "(?:%{DATA:referer}|-)" "(?:%{DATA:UserAgent}|-)""
],
"pattern_definitions": {
"IP_LIST": "(%{IP}(, %{IP})*)"
},
"ignore_missing": true

              }

}
]
}Preformatted text


(Abhilash Usha) #11

the logs to be parsed would be

12.234.21.234, 123.34.567.32 - - [08/Jan/2018:00:00:26 -0500] RspTime= 75484 microsecond + "POST /url1 HTTP/1.1" 200 492 d-dummy-101:1234 "-" "-"

12.234.21.234 - - [08/Jan/2018:00:00:26 -0500] RspTime= 75484 microsecond + "POST /url2 HTTP/1.1" 200 492 d-dummy-101:1234 "-" "-"

      • [08/Jan/2018:00:00:26 -0500] RspTime= 75484 microsecond + "POST /hubcosmosint/DealerAdmin HTTP/1.1" 200 492 d-dummy-101:1234 "-" "-"
        indent preformatted text by 4 spaces

(Abhilash Usha) #12

@steffens . I did try the pattern as you suggested , but that didn't help

{
"description": "Parse HTTP Access Logs",
"processors": [
{
"grok" : {
"field" : "message",
"patterns" : [
"^%{IP_LIST:client} %{NOTSPACE:ident} %{NOTSPACE:auth} [%{HTTPDATE:ts}] (?:RspTime= %{NUMBER:timetaken} microsecond) %{NOTSPACE:connstatus} "(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})" %{NUMBER:response} (?:%{NUMBER:bytes}|-) (?(?:%{HOSTNAME:server})(?::%{NUMBER:portnumber})|-) "(?:%{DATA:referer}|-)" "(?:%{DATA:UserAgent}|-)""
],
"pattern_definitions": {
"IP_LIST": "-|IPORHOST(, IPORHOST)*)"
},
"ignore_missing": true

              }

}
]
}
indent preformatted text by 4 spaces


(Steffen Siering) #13

Please use the </>-button to format your text. Or markedown fenced code block.
Indenting your code/logs doesn't help if no formatting is applied.

Have you had a look at the simulate API to run some tests and share code-snippets other users can actually run and modify?

While grok is based on regular expressions, you should not try to mix it with regular expressions (and create too much grouping). The way you configured your pattern I'm not even sure a correct regular expression will be generated...

Testing with simulate API, I found using %{IP} multiple times doesn't work. A workaround is to just capture everything until space without comma.

See (can be run with kibana developer console):

POST _ingest/pipeline/_simulate
{
  "pipeline" : {
    "description": "Parse HTTP Access Logs",
    "processors": [
      {
        "grok": {
          "field" : "message",
          "patterns" : [
            "^%{IP_LIST:ips} %{ANY:rest}"
          ],
          "pattern_definitions": {
            "IP_LIST": "(?:-|(?:%{IPORHOST}(?:, [^\\s]+)*))",
            "ANY": "(?:.*)"
           }
        }
      }
    ]
  },
  "docs" : [
    { "_source": { "message": "12.234.21.234, 123.34.567.32 - - [08/Jan/2018:00:00:26 -0500] RspTime= 75484 microsecond"} },
    { "_source": { "message": "12.234.21.234 - - [08/Jan/2018:00:00:26 -0500] RspTime= 75484 microsecond"} },
    { "_source": { "message": "- - - [08/Jan/2018:00:00:26 -0500] RspTime= 75484 microsecond"} }
  ]
}

This gets you:

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_type": "_type",
        "_id": "_id",
        "_source": {
          "rest": "- - [08/Jan/2018:00:00:26 -0500] RspTime= 75484 microsecond",
          "message": "12.234.21.234, 123.34.567.32 - - [08/Jan/2018:00:00:26 -0500] RspTime= 75484 microsecond",
          "ips": "12.234.21.234, 123.34.567.32"
        },
        "_ingest": {
          "timestamp": "2018-01-18T15:18:45.967Z"
        }
      }
    },
    {
      "doc": {
        "_index": "_index",
        "_type": "_type",
        "_id": "_id",
        "_source": {
          "rest": "- - [08/Jan/2018:00:00:26 -0500] RspTime= 75484 microsecond",
          "message": "12.234.21.234 - - [08/Jan/2018:00:00:26 -0500] RspTime= 75484 microsecond",
          "ips": "12.234.21.234"
        },
        "_ingest": {
          "timestamp": "2018-01-18T15:18:45.967Z"
        }
      }
    },
    {
      "doc": {
        "_index": "_index",
        "_type": "_type",
        "_id": "_id",
        "_source": {
          "rest": "- - [08/Jan/2018:00:00:26 -0500] RspTime= 75484 microsecond",
          "message": "- - - [08/Jan/2018:00:00:26 -0500] RspTime= 75484 microsecond",
          "ips": "-"
        },
        "_ingest": {
          "timestamp": "2018-01-18T15:18:45.967Z"
        }
      }
    }
  ]
}

(Amarender Kasireddy) #14

Hello @steffens,
We tried the above pattern to list IP's but still couldn't parse the calls successfully.

We are running on elasticsearch 5.3.0 and using same version of filebeat.
As we see most of the updates online are related to v6.1, we tried to upgrade our filebeat to 6.1.2.

after upgrading to 6.1.2, we are not able to parse the calls any call (even the calls with single IP).

Below is the error I see when parsing :
2018-02-01T11:38:20-05:00 WARN Can not index event (status=400): {"type":"mapper_parsing_exception","reason":"Failed to parse mapping [default]: Mapping definition for [error] has unsupported parameters: [properties : {code={type=long}, message={norms=false, type=text}, type={ignore_above=1024, type=keyword}}]","caused_by":{"type":"mapper_parsing_exception","reason":"Mapping definition for [error] has unsupported parameters: [properties : {code={type=long}, message={norms=false, type=text}, type={ignore_above=1024, type=keyword}}]"}}

I tried to use the same grok pattern we used for 5.3.0 and the same works for 5.3.0 but with 6.1.2, it is not working.

Am I missing something here? Do we need to do any additional configurations on 6.1.2?

Please let us know if you see something is missing here.

Thanks,
Amar


(Amarender Kasireddy) #15

I enabled the apache2 module, installed the ingest-geoip and ingest-user-agent plugins as mentioned in the user guide


(system) #16

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.