Apache HTTPD logs with (comma-delimited) X-Forwarded-For IPs

Hi Folks,

I thought I'd be able to use the filebeat apache module out of the box, but then I noticed the handling of X-Forwarded-For. It only seems to get the inner-most IP from it (which is a proxy, and not the end-user IP).

HTTPD Log Format

LogFormat "%{X-Forwarded-For}i %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\""

Sample:

2.2.2.2, 10.100.49.203 - - [21/May/2019:21:20:41 +0000] "GET /resource/reportmanagement/published/ESD_900000010053582_05112019_900000010189257_1526056201037.docx HTTP/1.1" 301 353 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
2.2.2.2, 10.100.48.62 - - [21/May/2019:21:20:41 +0000] "GET /onecpd/includes/themes/hudexchange/images/favicon.ico HTTP/1.1" 200 3638 "-" "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
2.2.2.2, 10.100.49.203 - - [21/May/2019:21:20:41 +0000] "GET /trainings/courses HTTP/1.1" 200 7175 "-" "PiplBot (+http://www.pipl.com/bot/)"
2.2.2.2, 10.100.49.203 - - [21/May/2019:21:20:41 +0000] "GET /s3redirect/?ref=/resource/reportmanagement/published/ESD_900000010053582_05112019_900000010189257_1526056201037.docx HTTP/1.1" 301 86 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"	

I want the 2.2.2.2 addresses, but I get the 10.100.* addresses:

{
  "_index": "filebeat-7.1.0-2019.05.20-000001",
  "_type": "_doc",
  "_id": "1JRv2GoBZOU7qiPlQNcD",
  "_version": 1,
  "_score": null,
  "_source": {
    "agent": {
      "hostname": "f64aab1e2619",
      "id": "38dfc846-4fec-47df-a33b-cde479c27f11",
      "type": "filebeat",
      "ephemeral_id": "6fe6e7e5-36d1-4d04-a9a4-ae263c6a345b",
      "version": "7.1.0"
    },
    "log": {
      "file": {
        "path": "/var/log/apache2/access_log"
      },
      "offset": 0
    },
    "source": {
      "address": "10.100.49.203",
      "ip": "10.100.49.203"
    },
    "fileset": {
      "name": "access"
    },
    "url": {
      "original": "/resource/reportmanagement/published/ESD_900000010053582_05112019_900000010189257_1526056201037.docx"
    },
    "input": {
      "type": "log"
    },
    "apache": {
      "access": {}
    },
    "@timestamp": "2019-05-21T21:20:41.000Z",
    "ecs": {
      "version": "1.0.0"
    },
    "service": {
      "type": "apache"
    },
    "host": {
      "name": "f64aab1e2619"
    },
    "http": {
      "request": {
        "referrer": "-",
        "method": "GET"
      },
      "response": {
        "status_code": 301,
        "body": {
          "bytes": 353
        }
      },
      "version": "1.1"
    },
    "event": {
      "created": "2019-05-21T03:28:09.661Z",
      "module": "apache",
      "dataset": "apache.access"
    },
    "user": {
      "name": "-"
    },
    "user_agent": {
      "original": "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
      "os": {
        "name": "Android",
        "version": "6.0.1",
        "full": "Android 6.0.1"
      },
      "name": "Googlebot",
      "device": {
        "name": "Spider"
      },
      "version": "2.1"
    }
  },
  "fields": {
    "suricata.eve.timestamp": [
      "2019-05-21T21:20:41.000Z"
    ],
    "@timestamp": [
      "2019-05-21T21:20:41.000Z"
    ],
    "event.created": [
      "2019-05-21T03:28:09.661Z"
    ]
  },
  "sort": [
    1558473641000
  ]
}

I'd like to leverage as much of the auto fanciness as possible. Is there any tweak I can do to keep this in filebeat instead of cobbling something together with logstash?

I saw some similar topics from a few years back, but I wasn't sure if there had been any beat developments in the meantime.

Thanks,
Jamie

Hi @jamiejackson,

Filebeat apache module only parses addresses in the common log format used by default by Apache. If you need to parse a custom log format you would need to define your own pipeline. Take into account that even if you need a custom pipeline you can still do it with an Elastic Ingest node, you don't need logstash only for that. You can read more about this here: https://www.elastic.co/guide/en/beats/filebeat/7.0/configuring-ingest-node.html

Alternatively if you don't want the 10.100.* addresses you could consider removing this part from the log format and replace it with the X-Forwarded-For part, so the original ip would be captured as the source ip.

Thanks for the response.

A few questions:

I hadn't looked into ingest pipelines before. Am I on the right track with this? (FYI, the ip_trail field is my attempt at preserving the original list of IPs from X-Forwarded-for.)

POST /_ingest/pipeline/_simulate
{
  "pipeline" : {
    "description": "Parse HTTP Access Logs",
    "processors": [
      {
        "grok": {
          "field" : "message",
          "patterns" : [
            "^%{IP_LIST:ip_trail}"
          ],
          "pattern_definitions": {
            "IP_LIST": "(?:-|(?:%{IPORHOST}(?:, [^\\s]+)*))"
           }
        }
      },
      {
		  "split": {
		    "field": "ip_trail",
		    "separator": "(,\\s+)" 
		  }
		},
      {
        "gsub": {
          "field" : "message",
          "pattern" : "^(\\d+\\.\\d+\\.\\d+.\\d+)(, \\d+\\.\\d+\\.\\d+\\.\\d+)*",
          "replacement" : "$1"
        }
      }
    ]
  },
  "docs" : [
    { "_source": { "message": "2.2.2.2, 10.100.49.203 - - [21\/May\/2019:21:20:41 +0000] \"GET \/resource\/reportmanagement\/published\/ESD_900000010053582_05112019_900000010189257_1526056201037.docx HTTP\/1.1\" 301 353 \"-\" \"Mozilla\/5.0 (Linux; Android 6.0.1; Nexus 5X Build\/MMB29P) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/41.0.2272.96 Mobile Safari\/537.36 (compatible; Googlebot\/2.1; +http:\/\/www.google.com\/bot.html)\""} },
	{"_source": { "message": "2.2.2.2, 10.100.49.203 - - [21\/May\/2019:21:20:41 +0000] \"GET \/resource\/reportmanagement\/published\/ESD_900000010053582_05112019_900000010189257_1526056201037.docx HTTP\/1.1\" 301 353 \"-\" \"Mozilla\/5.0 (Linux; Android 6.0.1; Nexus 5X Build\/MMB29P) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/41.0.2272.96 Mobile Safari\/537.36 (compatible; Googlebot\/2.1; +http:\/\/www.google.com\/bot.html)\""} },
    { "_source": { "message": "12.234.21.234 - - [08/Jan/2018:00:00:26 -0500] RspTime= 75484 microsecond"} },
    { "_source": { "message": "- - - [08/Jan/2018:00:00:26 -0500] RspTime= 75484 microsecond"} }
  ]
}

Which yields:

{
    "docs": [
        {
            "doc": {
                "_index": "_index",
                "_type": "_doc",
                "_id": "_id",
                "_source": {
                    "message": "2.2.2.2 - - [21/May/2019:21:20:41 +0000] \"GET /resource/reportmanagement/published/ESD_900000010053582_05112019_900000010189257_1526056201037.docx HTTP/1.1\" 301 353 \"-\" \"Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)\"",
                    "ip_trail": [
                        "2.2.2.2",
                        "10.100.49.203"
                    ]
                },
                "_ingest": {
                    "timestamp": "2019-05-21T16:46:10.92033Z"
                }
            }
        },
        {
            "doc": {
                "_index": "_index",
                "_type": "_doc",
                "_id": "_id",
                "_source": {
                    "message": "2.2.2.2 - - [21/May/2019:21:20:41 +0000] \"GET /resource/reportmanagement/published/ESD_900000010053582_05112019_900000010189257_1526056201037.docx HTTP/1.1\" 301 353 \"-\" \"Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)\"",
                    "ip_trail": [
                        "2.2.2.2",
                        "10.100.49.203"
                    ]
                },
                "_ingest": {
                    "timestamp": "2019-05-21T16:46:10.920343Z"
                }
            }
        },
        {
            "doc": {
                "_index": "_index",
                "_type": "_doc",
                "_id": "_id",
                "_source": {
                    "message": "12.234.21.234 - - [08/Jan/2018:00:00:26 -0500] RspTime= 75484 microsecond",
                    "ip_trail": [
                        "12.234.21.234"
                    ]
                },
                "_ingest": {
                    "timestamp": "2019-05-21T16:46:10.920348Z"
                }
            }
        },
        {
            "doc": {
                "_index": "_index",
                "_type": "_doc",
                "_id": "_id",
                "_source": {
                    "message": "- - - [08/Jan/2018:00:00:26 -0500] RspTime= 75484 microsecond",
                    "ip_trail": [
                        "-"
                    ]
                },
                "_ingest": {
                    "timestamp": "2019-05-21T16:46:10.920353Z"
                }
            }
        }
    ]
}

Also, this will need to be automated. Is there a way for the filebeat to create the pipeline on elasticsearch, or must that be a separate step?

Finally, let me know if there are any refinements I should consider. (I'm new to all of this, so tips are appreciated.)

Alternatively if you don't want the 10.100.* addresses you could consider removing this part from the log format and replace it with the X-Forwarded-For part, so the original ip would be captured as the source ip.

Thanks, I might consider that, down the line, if those intermediate addresses don't prove useful.

1 Like

My latest problem is that the pipeline I'm specifying in filebeat.yml doesn't seem to be honored. (It seems to use the module's default.)

filebeat.modules:
  - module: apache
    access:
      enabled: true
      var.paths: ["/var/log/apache2/access*"]
    error:
      enabled: true
      var.paths: ["/var/log/apache2/error*"]

filebeat.config.modules:
  reload.enabled: true
  reload.period: 5s
  
setup.kibana:
  host: "kibana-monitoring:5601"

output.elasticsearch:
  hosts: ["elasticsearch-monitoring:9200"]
  pipeline: apache_with_x_forwarded_for

Are modules incompatible with output.elasticsearch.pipeline specifications?

Yes, the idea would be to define your own pipeline as you are doing. You could start with the one used by the module, that you can find here: https://github.com/elastic/beats/blob/v7.0.0/filebeat/module/apache/access/ingest/default.json

You have to install the custom pipelines that you create, to automate this you can use the API.

It can be tricky to make modules work with custom pipelines, you need to override the settings of the input defined by the module, in your case something like this could work (not tested):

filebeat.modules:
  - module: apache
    access:
      input:
        pipeline: apache_with_x_forwarded_for
        paths: ["/var/log/apache2/access*"]
    error:
      var.paths: ["/var/log/apache2/error*"]

filebeat.config.modules:
  reload.enabled: true
  reload.period: 5s
  
setup.kibana:
  host: "kibana-monitoring:5601"

output.elasticsearch:
  hosts: ["elasticsearch-monitoring:9200"]

You can find more information in the documentation about advanced module configuration, and the options you can use under input are the ones of the log input.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.