New GeoIP ingest processor is missing databases, but existing GeoIP processors work fine

Elasticsearch 8.4.1 and 8.5.2.

When creating a new ingest pipeline with the GeoIP processor the MaxMind databases are not found. Yet, the existing managed pipelines (logs-system.security*, logs-iptables.log*, etc) don't have any issues finding the databases.

The pipeline:

      {
        "geoip": {
          "field": "source.ip",
          "target_field": "source.geo"
        }
      }
    ]
  }

The simulation:

POST _ingest/pipeline/mypipeline/_simulate
{
  "docs": [
  {
    "_source": {
      "source": {
        "ip": "8.8.8.8"
      }
    }
  }
]
}

The result:

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_version": "-3",
        "_source": {
          "source": {
            "ip": "8.8.8.8"
          },
          "tags": [
            "_geoip_database_unavailable_GeoLite2-City.mmdb"
          ]
        },
        "_ingest": {
          "timestamp": "2022-12-02T14:49:20.010961208Z"
        }
      }
    }
  ]
}

I've verified the MaxMind databases exist on the ingest nodes using the GET _ingest/geoip/stats?pretty API call. I've also followed all the suggestions from another post to get the databases to download.
https://discuss.elastic.co/t/geoip-processor-setup/298227

There's nothing in the elasticsearch logs to indicate a problem, and the logs do show the databases are successfully downloaded and used.

Again, the existing managed pipelines that use GeoIP are still working just fine. This error only appears on new pipelines.

Hi @twilson

Hmm Brand New / Default 8.5.2 seems to work fine... see below

Is this multi-node cluster?

Was it recently upgraded?

You could try to set (you can also set this in elasticsearch.yml )

PUT _cluster/settings
{
  "persistent": {
      "ingest.geoip.downloader.enabled" : false
    
  }
}

Give it a moment to clean up.... BTW When I set this ... I get your exact error see below(er)

Then set it

PUT _cluster/settings
{
  "persistent": {
      "ingest.geoip.downloader.enabled" : true
    
  }
}

Give it a few moments to propagate and then it works...

Exactly what is the output of

GET _ingest/geoip/stats?pretty

On my Brand New cluster

PUT _ingest/pipeline/discuss-geoip
{
  "processors": [
      {
        "geoip": {
          "field": "source.ip",
          "target_field": "source.geo"
        }
      }
  ]
}

POST _ingest/pipeline/discuss-geoip/_simulate
{
  "docs": [
    {
      "_source": {
        "source": {
          "ip": "8.8.8.8"
        }
      }
    }
  ]
}

# Result

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_version": "-3",
        "_source": {
          "source": {
            "geo": {
              "continent_name": "North America",
              "country_iso_code": "US",
              "country_name": "United States",
              "location": {
                "lon": -97.822,
                "lat": 37.751
              }
            },
            "ip": "8.8.8.8"
          }
        },
        "_ingest": {
          "timestamp": "2022-12-03T17:56:37.998520152Z"
        }
      }
    }
  ]
}

Now disabling the the geoIP databases

PUT _cluster/settings
{
  "persistent": {
      "ingest.geoip.downloader.enabled" : false
    
  } 
}

Now run the same

POST _ingest/pipeline/discuss-geoip/_simulate
{
  "docs": [
    {
      "_source": {
        "source": {
          "ip": "8.8.8.8"
        }
      }
    }
  ]
}

# Results... Your Exact Error

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_version": "-3",
        "_source": {
          "source": {
            "ip": "8.8.8.8"
          },
          "tags": [
            "_geoip_database_unavailable_GeoLite2-City.mmdb"
          ]
        },
        "_ingest": {
          "timestamp": "2022-12-03T18:13:02.442400704Z"
        }
      }
    }
  ]
}

Then I re-enable and it works as above

PUT _cluster/settings
{
  "persistent": {
      "ingest.geoip.downloader.enabled" : true
    
  }
}

Trust me, I've been through the entire process of disabling the downloader, restarting nodes, enabling the downloader, etc.

The databases are available on the two ingest nodes. They are not available on the non-ingest data nodes, which I believe is the proper behavior.

{
  "stats": {
    "successful_downloads": 3,
    "failed_downloads": 0,
    "total_download_time": 4767,
    "databases_count": 3,
    "skipped_updates": 3,
    "expired_databases": 0
  },
  "nodes": {
    "ICFrLIFLR9iL7yACaLbx6g": {
      "databases": [
        {
          "name": "GeoLite2-Country.mmdb"
        },
        {
          "name": "GeoLite2-ASN.mmdb"
        },
        {
          "name": "GeoLite2-City.mmdb"
        }
      ],
      "files_in_temp": [
        "GeoLite2-ASN.mmdb_elastic-geoip-database-service-agreement-LICENSE.txt",
        "GeoLite2-ASN.mmdb_LICENSE.txt",
        "GeoLite2-City.mmdb_LICENSE.txt",
        "GeoLite2-Country.mmdb_elastic-geoip-database-service-agreement-LICENSE.txt",
        "GeoLite2-ASN.mmdb",
        "GeoLite2-City.mmdb_COPYRIGHT.txt",
        "GeoLite2-City.mmdb",
        "GeoLite2-City.mmdb_elastic-geoip-database-service-agreement-LICENSE.txt",
        "GeoLite2-Country.mmdb_LICENSE.txt",
        "GeoLite2-Country.mmdb",
        "GeoLite2-ASN.mmdb_COPYRIGHT.txt",
        "GeoLite2-Country.mmdb_COPYRIGHT.txt",
        "GeoLite2-City.mmdb_README.txt"
      ]
    },
    "niiwjGDqTEWN_Bt7IyrAkQ": {
      "databases": [
        {
          "name": "GeoLite2-Country.mmdb"
        },
        {
          "name": "GeoLite2-ASN.mmdb"
        },
        {
          "name": "GeoLite2-City.mmdb"
        }
      ],
      "files_in_temp": [
        "GeoLite2-ASN.mmdb_elastic-geoip-database-service-agreement-LICENSE.txt",
        "GeoLite2-ASN.mmdb_LICENSE.txt",
        "GeoLite2-City.mmdb_LICENSE.txt",
        "GeoLite2-Country.mmdb_elastic-geoip-database-service-agreement-LICENSE.txt",
        "GeoLite2-ASN.mmdb",
        "GeoLite2-City.mmdb_COPYRIGHT.txt",
        "GeoLite2-City.mmdb",
        "GeoLite2-City.mmdb_elastic-geoip-database-service-agreement-LICENSE.txt",
        "GeoLite2-Country.mmdb_LICENSE.txt",
        "GeoLite2-Country.mmdb",
        "GeoLite2-ASN.mmdb_COPYRIGHT.txt",
        "GeoLite2-Country.mmdb_COPYRIGHT.txt",
        "GeoLite2-City.mmdb_README.txt"
      ]
    }
  }
}

I've even tried copying the mmdb files from the directory on /tmp to /etc/elasticsearch/ingest-geoip and I end up with the same error.

The only thing I can think of is this is somehow related to testing a custom mmdb in an ingest pipeline. I have an mmdb that I use in logstash to geolocate our internal IP range. As we move towards using the elastic agent more I wanted to incorporate this into a pipeline. Well, elasticsearch didn't like the database and kept throwing errors about field type mismatches, even though logstash had no problem with the database. I removed the pipeline and the associated mmdb files from /etc/elasticsearch/ingest-geoip (and even deleted that directory since it was no longer needed). This is when I noticed that new GeoIP processors aren't working but the existing GeoIP processors are working just fine.

I think this is the issue... try putting the DBs etc on the data nodes as well... if it works I will attempt to explain why. tl;dr My understanding (and I just learned this) is geoIP is actually an enrich under the covers... enrich data actually lives on data node not Ingest Nodes ... yes that does not really make sense but that is my understanding...

Well how about that, copying the mmdb files to /etc/elasticsearch/ingest-geoip on the data nodes and the coordinator node ended up getting things to work.

So this brings up a couple questions

  1. Will these databases update automatically or do I need to use the directions (https://www.elastic.co/guide/en/elasticsearch/reference/master/geoip-processor.html#manually-update-geoip-databases) for manually updating the databases on the non-ingest and coordinator nodes? I guess I'll check the logs on Thursday to see if they are updating automatically.
  2. If the GeoIP processors is actually an enrichment processor, why isn't elasticsearch smart enough to figure out that the databases need to be available on all the data, ingest, and coordinator nodes? I have used the enrichment processor for other purposes and the data is automatically stored on all the nodes.
  3. Why were the existing/managed GeoIP ingest processors still working without any problems and this only impacting new GeoIP processors? This is really the biggest question I have.

They will get updated automatically assuming you have connectivity and have set ingest.geoip.downloader.enabled : true in your elasticsearch.yml (actuall that is the default just don't set to false).

The Exact schedule I do not have the schedule / process at my finger tips ... Not sure why you are thinking thursday they are updated initially when a node is started then I think the MaxMind are updated Bi-Weekly or something ... and there is a 30 day window... this all happens automatically as the default behavior

By default, the processor uses the GeoLite2 City, GeoLite2 Country, and GeoLite2 ASN GeoIP2 databases from MaxMind, shared under the CC BY-SA 4.0 license. Elasticsearch automatically downloads updates for these databases from the Elastic GeoIP endpoint: https://geoip.elastic.co/v1/database. To get download statistics for these updates, use the GeoIP stats API.

If your cluster can’t connect to the Elastic GeoIP endpoint or you want to manage your own updates, see Manage your own GeoIP2 database updates.

If Elasticsearch can’t connect to the endpoint for 30 days all updated databases will become invalid. Elasticsearch will stop enriching documents with geoip data and will add tags: ["_geoip_expired_database"] field instead.

That is a looooong discussion and it is recognized as an issue but it goes something like this.

Data nodes hold Data / Indices
Ingest Nodes by role / definition Do Not Hold Data
Enrich Policies are Data Indices (geoIP included)
So the enrich policies which are Data / Indices live on Data nodes (not Ingest Nodes) and are accessed by Ingest Nodes / Pipelines that reach out to the Data Nodes.

The Engineering / Product team recognizes this and I think is working on a solution as part of a larger ingest processing theme, the current behavior is a byproduct of strong node roles....

No Clue... perhaps those indices were still on the Data Nodes...

The Exact schedule I do not have the schedule / process at my finger tips ... Not sure why you are thinking thursday they are updated initially when a node is started then I think the MaxMind are updated Bi-Weekly or something ... and there is a 30 day window... this all happens automatically as the default behavior

According to the documentation the databases are checked every 3 days for updates. I can confirm that on my ingest nodes as I see log entries like this every 3 days

evicted [500] entries from cache after reloading database [/tmp/elasticsearch-16375421549390025677/geoip-databases/H-evuPMGSjG8Bg0_s9Njpw/GeoLite2-City.mmdb]
successfully loaded geoip database file [GeoLite2-City.mmdb]

However, I'm not seeing these messages on the data nodes, which would indicate the database files on the data nodes need to be updated manually (cron job), ideally on the same schedule as the ingest nodes.

Cool I just saw the 3 Days ...

By default, Elasticsearch checks the endpoint for updates every three days. To use another polling interval, use the cluster update settings API to set ingest.geoip.downloader.poll.interval.

Not sure why... I have never needed to load them manually that is the whole point of the system.

When did you fix the data nodes?

Did you put the download enabled true setting in the elasticsearch.yml on each of the nodes?

Did you restart the nodes... are there other nodes where you set that to false?

Do the data nodes have connectivity to the download server?

If properly configured you should not need to manually update.

You should be able to take out all the settings you have been playing with ... all the defaults should do all the work. Other than some testing I always us the defaults and the GeoIP DBs are always kept up to date.

When did you fix the data nodes?

I copied the mmdb files to /etc/elasticsearch/ingest-geoip on the data nodes this morning, so they would not have updated yet based on the 3 day interval. But now I have removed them from the ingest-geoip directory and restarted the nodes hoping that the files will be automatically downloaded, as you say they should be.

Did you put the download enabled true setting in the elasticsearch.yml on each of the nodes?

I have not set this in the configuration of any nodes since I assumed the default cluster setting would be to enable the download for all nodes. With the cluster setting not set (default is to enable the setting) I see the updated mmdb files being recognized on the ingest nodes, but not the data nodes. Explicitly enabling the setting in the cluster settings the same behavior is seen.

I've added this to one of the data nodes and restarted it. The databases were not downloaded at the startup of the service (the /tmp/systemd-private-* path was mentioned in the post I referenced above where you were assisting another person with a similar problem).

ls -l /tmp/systemd-private-0802e64137d84063b3c6395fd7b63f1a-elasticsearch.service-opNH7t/tmp/elasticsearch-9805782874506981636/geoip-databases/k6NjbiaIQK2CzxOB-18ICg/

total 0

Debug logging seems to indicate the data nodes want to use geoip databases as the logs show a directory specified

[2022-12-05T15:27:31,133][DEBUG][o.e.i.g.DatabaseNodeService] [data4] initialized database node service, using geoip-databases directory [/tmp/elasticsearch-18183017223020425866/geoip-databases/PCRsWJH2SeKXD5qlKY0-iw]

But, that directory doesn't exist. The directory mentioned in other posts about missing geoip databases is also empty, which is telling me the data nodes are not downloading the databases as they should be.

Did you restart the nodes...

Yes, a rolling restart was performed of all the elasticsearch nodes.

are there other nodes where you set that to false?

There are no nodes where the download setting is set to false.

Do the data nodes have connectivity to the download server?

Yes, all nodes have full internet connectivity.

Not sure what your issues with /tmp is.

You can purposeful set enabled to false and start the node it should clean everything up.

Then set to true and it should pull down and load.... in a cluster I am not sure if it pulls it to every node or not what is important is the data gets loaded into the cluster. I will have to ask if it every node pulls the DBs or not / only 1 Selected..... not sure. (Checked appears to download to each not at my first look)

green open .geoip_databases ytTW2O08Rp6gIVOwfjgblw 1 0 41 0 38.9mb 38.9mb

Is available ... is it available and Working?

Are we just debugging the downloading at this point?

Sorry to be repetitious and you also cleared the cluster setting... we just want to make sure there is not inconsistency...

ok @twilson After Consulting the Internal Folks Here is how it all works....

So for this discussion I have a 3 node Cluster. For this Exercise the masters are moot.
1 Ingest Only (es01)
2 Data Only (es02, es03)

note the roles... really important

GET _cat/nodes?v
ip           heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
192.168.16.4           33         100   4    0.35    0.25     0.65 dm        *      es02
192.168.16.5           38         100   4    0.35    0.25     0.65 dm        -      es03
192.168.16.3           59         100   4    0.35    0.25     0.65 im        -      es01

With respect to downloading the Max Mind Databases here is what happens.
The Max Mind mmdb Databases are ONLY downloaded any Ingest Node, but NOT nodes that are not Ingest, and to slightly confuse the matter on the Data Only nodes the /tmp directory will be created but the actually mmdbs will not be downloaded (I think this is what you are seeing)

So on my cluster
es01 has the mmdb files
es02, es03 have the directory but no actual files

With respect to the the .geoip_databases It only lives on and Data Nodes, since as we discussed it is actually Data / Index.

Now here is the really really interesting part... and pretty perhaps we were running into it, and I spoke internally this is an edge case but we should probably document what is going on.

Follow along let see if this is it.

First Lets Actually Index a Document using the discuss-geoip pipeline

First we will index via the Ingest Node es01 and it works as expected.

https://es01:9200/discuss-geoip-index/_doc/?pipeline=discuss-geoip
# Body
{
  "source": {
    "ip": "8.8.8.8"
  }
}

# Result
{
    "_index": "discuss-geoip-index",
    "_id": "g3Q-5YQBl5l69aM7WMi1",
    "_version": 1,
    "result": "created",
    "_shards": {
        "total": 2,
        "successful": 2,
        "failed": 0
    },
    "_seq_no": 5,
    "_primary_term": 1
}

Next we will index via the Data Node es02 and it works as expected. The Data node receives the indexing request and since it has an ingest pipeline request it routes to the ingest node ES01 and then it works just like we sent it to ES01 in the first place.

https://es01:9200/discuss-geoip-index/_doc/?pipeline=discuss-geoip
# Body
{
  "source": {
    "ip": "8.8.8.8"
  }
}

# Result 
{
    "_index": "discuss-geoip-index",
    "_id": "hHRK5YQBl5l69aM7Vchv",
    "_version": 1,
    "result": "created",
    "_shards": {
        "total": 2,
        "successful": 2,
        "failed": 0
    },
    "_seq_no": 6,
    "_primary_term": 1
}

Now here is where the confusion comes in....

Lets _simulate

First we will simulate against the ingest node es01 and we will get the expected results.

https://es01:9200/_ingest/pipeline/discuss-geoip/_simulate
# Body
{
  "docs": [
    {
      "_source": {
        "source": {
          "ip": "8.8.8.8"
        }
      }
    }
  ]
}

# Results
{
    "docs": [
        {
            "doc": {
                "_index": "_index",
                "_id": "_id",
                "_version": "-3",
                "_source": {
                    "source": {
                        "geo": {
                            "continent_name": "North America",
                            "country_iso_code": "US",
                            "country_name": "United States",
                            "location": {
                                "lon": -97.822,
                                "lat": 37.751
                            }
                        },
                        "ip": "8.8.8.8"
                    }
                },
                "_ingest": {
                    "timestamp": "2022-12-06T02:38:11.958718312Z"
                }
            }
        }
    ]
}

Now the case I think you are running into, and the case that we should probably document.

Lets now _simulate against the Data Only node es02

https://es02:9200/_ingest/pipeline/discuss-geoip/_simulate
# Body
{
  "docs": [
    {
      "_source": {
        "source": {
          "ip": "8.8.8.8"
        }
      }
    }
  ]
}

# Results .... Should look familiar

{
    "docs": [
        {
            "doc": {
                "_index": "_index",
                "_id": "_id",
                "_version": "-3",
                "_source": {
                    "source": {
                        "ip": "8.8.8.8"
                    },
                    "tags": [
                        "_geoip_database_unavailable_GeoLite2-City.mmdb"
                    ]
                },
                "_ingest": {
                    "timestamp": "2022-12-06T02:42:39.943966396Z"
                }
            }
        }
    ]
}

So it appears that you can not run _simulate that relies on the actual mmdbs against a data node that does not have them. Or more correctly you can run a _simulate but if it relies on the geoip database it will fail.

Actually indexing a document will get properly routed but it appears that _simulate does not automatically get routed to an Ingest node where the mmdbs are available it is executed on the node it is directed to regardless whether it is an Ingest Node or not.

I did test you can _simulate a pipeline on a Data only node if it does not have that dependency and it works. I tested and validated that.

I am thinking that this hits all the points... perhaps not.

So summary

  • The geoip mmdbs are only loaded to Ingest Nodes (even though other nodes make empty dirs)
  • You can index a document with the geo-ip processor /pipeline to either a Data Node or Ingest node and it will get properly routed to the Ingest Node.
  • _simulate with the geoip processor will only work properly on an Ingest Node and will Fail on a Data Only Node

I am posting these finding internally... I will keep an eye for the response.

Your explanation makes perfect sense. I have Kibana pointed at a non-ingest, non-data node so the mmdb files wouldn't exist on that node, hence the error message. Using curl I was able to simulate against an ingest node and the pipeline worked, and simulating against a data node failed. That's the behavior you outlined, and is what is to be expected.

I guess I was under the impression that using _simulate on a pipeline through the dev tools console would route the request to an appropriate (ie. ingest) node. I'm going to add this to my notes.

Now I just need to figure out the type mismatch errors I get when using a custom mmdb, but that's a discussion for a new thread at some point.

Thank you very much for the help.

1 Like

You are not the only one on this one ... including myself until last night.

Internally the team is discussing ... thinking we should consider it a bug, thank for brining it to our attention... very subtle and hard to figure out!

Thanks for finding this! We have a PR up to fix it in the way that you describe -- Forwarding simulate calls to ingest nodes by masseyke · Pull Request #92171 · elastic/elasticsearch · GitHub.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.