ok @twilson After Consulting the Internal Folks Here is how it all works....
So for this discussion I have a 3 node Cluster. For this Exercise the masters are moot.
1 Ingest Only (es01)
2 Data Only (es02, es03)
note the roles... really important
GET _cat/nodes?v
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
192.168.16.4 33 100 4 0.35 0.25 0.65 dm * es02
192.168.16.5 38 100 4 0.35 0.25 0.65 dm - es03
192.168.16.3 59 100 4 0.35 0.25 0.65 im - es01
With respect to downloading the Max Mind Databases here is what happens.
The Max Mind mmdb
Databases are ONLY downloaded any Ingest Node, but NOT nodes that are not Ingest, and to slightly confuse the matter on the Data Only nodes the /tmp
directory will be created but the actually mmdb
s will not be downloaded (I think this is what you are seeing)
So on my cluster
es01 has the mmdb
files
es02, es03 have the directory but no actual files
With respect to the the .geoip_databases
It only lives on and Data Nodes, since as we discussed it is actually Data / Index.
Now here is the really really interesting part... and pretty perhaps we were running into it, and I spoke internally this is an edge case but we should probably document what is going on.
Follow along let see if this is it.
First Lets Actually Index a Document using the discuss-geoip
pipeline
First we will index via the Ingest Node es01
and it works as expected.
https://es01:9200/discuss-geoip-index/_doc/?pipeline=discuss-geoip
# Body
{
"source": {
"ip": "8.8.8.8"
}
}
# Result
{
"_index": "discuss-geoip-index",
"_id": "g3Q-5YQBl5l69aM7WMi1",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"_seq_no": 5,
"_primary_term": 1
}
Next we will index via the Data Node es02
and it works as expected. The Data node receives the indexing request and since it has an ingest pipeline request it routes to the ingest node ES01 and then it works just like we sent it to ES01 in the first place.
https://es01:9200/discuss-geoip-index/_doc/?pipeline=discuss-geoip
# Body
{
"source": {
"ip": "8.8.8.8"
}
}
# Result
{
"_index": "discuss-geoip-index",
"_id": "hHRK5YQBl5l69aM7Vchv",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"_seq_no": 6,
"_primary_term": 1
}
Now here is where the confusion comes in....
Lets _simulate
First we will simulate against the ingest node es01
and we will get the expected results.
https://es01:9200/_ingest/pipeline/discuss-geoip/_simulate
# Body
{
"docs": [
{
"_source": {
"source": {
"ip": "8.8.8.8"
}
}
}
]
}
# Results
{
"docs": [
{
"doc": {
"_index": "_index",
"_id": "_id",
"_version": "-3",
"_source": {
"source": {
"geo": {
"continent_name": "North America",
"country_iso_code": "US",
"country_name": "United States",
"location": {
"lon": -97.822,
"lat": 37.751
}
},
"ip": "8.8.8.8"
}
},
"_ingest": {
"timestamp": "2022-12-06T02:38:11.958718312Z"
}
}
}
]
}
Now the case I think you are running into, and the case that we should probably document.
Lets now _simulate
against the Data Only node es02
https://es02:9200/_ingest/pipeline/discuss-geoip/_simulate
# Body
{
"docs": [
{
"_source": {
"source": {
"ip": "8.8.8.8"
}
}
}
]
}
# Results .... Should look familiar
{
"docs": [
{
"doc": {
"_index": "_index",
"_id": "_id",
"_version": "-3",
"_source": {
"source": {
"ip": "8.8.8.8"
},
"tags": [
"_geoip_database_unavailable_GeoLite2-City.mmdb"
]
},
"_ingest": {
"timestamp": "2022-12-06T02:42:39.943966396Z"
}
}
}
]
}
So it appears that you can not run _simulate
that relies on the actual mmdb
s against a data node that does not have them. Or more correctly you can run a _simulate
but if it relies on the geoip database it will fail.
Actually indexing a document will get properly routed but it appears that _simulate
does not automatically get routed to an Ingest node where the mmdb
s are available it is executed on the node it is directed to regardless whether it is an Ingest Node or not.
I did test you can _simulate
a pipeline on a Data only node if it does not have that dependency and it works. I tested and validated that.
I am thinking that this hits all the points... perhaps not.
So summary
- The geoip
mmdb
s are only loaded to Ingest Nodes (even though other nodes make empty dirs)
- You can index a document with the geo-ip processor /pipeline to either a Data Node or Ingest node and it will get properly routed to the Ingest Node.
-
_simulate
with the geoip processor will only work properly on an Ingest Node and will Fail on a Data Only Node
I am posting these finding internally... I will keep an eye for the response.