Elastic APM GeoIP Pipeline

Hi, we want to enrich our Elastic-APM RUM data with GeoIP information and are wondering what the best place is to put it into:

  1. Elasticsearch Ingest Pipeline
  2. Logstash (but it means having to route from Elastic-APM to Logstash)
  3. Wait for https://github.com/elastic/apm-server/issues/1283 (but we are not sure if this will cover all our use cases, and it is not clear to us if this is already available)

Basically, we'd like the ability to set granular GeoIP data, maybe even add more custom GeoIP pipelines.

Another use case of ours is to load test GeoIP processing by using X-Forwarded-For headers from generators with simulated IP addresses

Any recommendations?

Many thanks!

The Elasticsearch Ingest Pipeline approach will be your best bet here. I expect the issue you linked to will be resolved by defining a default ingest pipeline.

The documentation can fill in the details but briefly you can process your RUM data using these steps:

  1. Create a pipeline definition. This will put the GeoIP results under a top-level user.geo field:
PUT _ingest/pipeline/apm_user_geoip
{
  "description": "Resolve GeoIP information for APM events",
  "processors": [
    {
      "geoip": {
        "field": "context.user.ip",
        "target_field": "user.geo",
        "ignore_missing": true
      }
    }
  ]
}

apm-server can register this for you if you prefer (see the docs).

  1. Direct apm-server to use this pipeline when indexing, update apm-server.yml:
output.elasticsearch:
  pipelines:
  - pipeline: apm_user_geoip
  1. Verify pipeline is configured correctly:
GET /_ingest/pipeline/apm_user_geoip/_simulate
{
 "docs": [
    {
      "_source": {
        "context": {
          "user": {
            "ip": "108.2.12.80"
          }
        }
      }
    }
  ] 
}

returns:

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_type",
        "_id" : "_id",
        "_source" : {
          "context" : {
            "user" : {
              "ip" : "108.2.12.80"
            }
          },
          "user" : {
            "geo" : {
              "continent_name" : "North America",
              "region_iso_code" : "US-PA",
              "city_name" : "Philadelphia",
              "region_name" : "Pennsylvania",
              "location" : {
                "lon" : -75.1968,
                "lat" : 39.9597
              },
              "country_iso_code" : "US"
            }
          }
        },
        "_ingest" : {
          "timestamp" : "2019-01-02T19:18:18.570566Z"
        }
      }
    }
  ]
}

Events produced by the RUM (and other) agents will get the same treatment. Note that these fields are not indexed by default - you'll have to update your mapping manually to achieve that until https://github.com/elastic/apm-server/issues/1283 is resolved.

Thank you very much Gil for that information. I did all that you mentioned and was indeed getting the correct pipeline result of converting context IP but missed the need to map the resulting lon-lat into a geo_point. That brought me to finding the appropriate reference and we will be testing this today.

Kind regards,

Ronald

PS There is a reference to the geo_point mapping in 6.x but not in 6.6/6.5/lower. Is this new?

@digitalron That's great news. We'd love to hear how your testing went if you're able to report back.

The geo_point mapping itself is not new, just the documentation is - https://github.com/elastic/elasticsearch/pull/29114.

Hi Gil,

So I am now able to successfully integrate the geoip pipeline and mapping and generate a visualisation like this one:

However, I am encountering a bit of a problem.
What I did was create a definition.json for the pipeline and configure apm-server.yml so that registration with Elasticsearch and APM is automated. Without creating a mapping, this results in a float for the lon-lat parameters, and I can't change the mapping anymore when the apm index has been created.

What I was able to do was create a geo_point mapping and inject it to the apm index BEFORE enabling the APM-Server but this results in the creation of new fields like context.service.name.keyword once the APM-Server onboards which then messes with all the Kibana UI for searching, like I can't see the APM services because the service names get into context.service.name.keyword instead of context.service.name (please see the filter in the screenshot)

As I am new to Painless and pipeline definition writing, I couldn't figure a way to force the geo_point type for location in the definition.json file. I've tried also to look for clear documentation on this but couldn't find any.

Here is my definition.json:

[{
  "id": "apm_geoip",
  "body": {
    "description" : "Add geoip information for APM events",
    "processors" : [
      {
        "geoip" : {
          "field": "context.system.ip",
          "target_field": "user.geoip",
          "ignore_missing": true
        }
      },
      {
        "geoip" : {
          "field": "context.user.ip",
          "target_field": "user.geoip",
          "ignore_missing": true
        }
      }
    ]
  }
}]

and my mapping

PUT apm-6.5.4-transaction-2019.01.04
{
  "mappings": {
    "doc": {
      "properties": {
        "user.geoip": {
          "properties": {
            "location": { "type": "geo_point" }
          }
        }
      }
    }
  }
}

I'm know I'm doing something wrong here. I know I can reindex the default apm index to a new one so that the location field gets converted to a geopoint but I don't think that is correct either as there shouldn't be a need to do any manual task after correct configuration and setup is done.

I would greatly appreciate a point in the right direction for this please.

Thanks and warm regards,

Ronald

I just realised I should be able to address this issue with index templates. :grin:

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.