Transforms -how to add common fields (i.e. geoip-fields) to the transform output?

Basically I have thousands of embedded devices that periodically upload logfiles to a backend we've created and after some preprocessing the log files are ingested into elasticserach using the geoip igestion pipeline. At the time of ingestion, the IP address the device has connected from has been injected into every log line and thus we get the geoip fields added to every document together with a bunch of other information that is related to the log file, such as device serial number, session id etc. These fields are allways present in every document and within a log file (= same session-id), these fields are identical between documents (serial number will not change during a session, session-GUID is the same for all documents belonging to a particular log file etc.).

Now, I want to do a transform to compute some more complicated metrics from each log session and I run into trouble...

I figured out a (crude) way to add the fields that are identical and always present within a log file (all documents in a log file has the same session-guid). I can start grouping by session-id, and then I can group-by serial number etc. because they will not change within a session-id (so in reality I just group by the session id, and then the fields for serial number and what not is added to the transform output).

Now, with geoip, it gets trickier... Since all geoip fields are not allways present after ingesting the document with the geoip ingestion pipeline (for example region, city etc. may not be added if the device has connected from a remote area) trying to group_by every geoip-field will end up in just getting a fraction of the log files transformed. If the group_by field is missing, the transform will just disregard all the documents in that group and no aggregations will be performed and I end up with just a fraction of the documents beeing transformed (the ones that have a complete set geoip fields).

So, how can I add geoip fields and values that is present and identical in every document that has been "grouped_by" to the result of my transformation? It may be that I am missing something obvious here (I'm a newbie on elastic), but I really need some help on how this is best done, how I can add specific fields and values from a document to the resulting document of the transform?

Hope this makes sense -otherwise, ask and I'll try to clarify...

There is currently no way to do this with the out of the box aggregations, but you can use a scripted_metric, which gives you scripting abilities to solve your problem.

Our docs contain some painless examples, in your case you could take the 1st value that is not null.

Another option is using terms, this will return a frequency list, e.g.

"a": 34,
"b": 555
"c": 2

However that format isn't suitable for creating e.g. a dashboard on it.

You can set missing_bucket to true in order to group on sparse data, the syntax is the same as for composite aggregations.

1 Like

Thanks for the information. I've done some simple scripted metrics, but I am not really sure how to extract and then inject the information I want using painless.

Anyway, I tried setting missing_bucket to true, but it still seem to ignore documents with missing fields. This is the grouping I have defined in the "pivot configuration object" in Kibana "Create Transform" UI:

{
  "group_by": {
"session.id": {
  "terms": {
    "field": "session.id"
  }
},
"session.device.serial.keyword": {
  "terms": {
    "field": "session.device.serial.keyword"
  }
},
"session.ip.keyword": {
  "terms": {
    "field": "session.ip.keyword"
  }
},
"session.geoip.city_name.keyword": {
  "terms": {
    "field": "session.geoip.city_name.keyword", "missing_bucket": true
  }
},
"session.geoip.continent_name.keyword": {
  "terms": {
    "field": "session.geoip.continent_name.keyword", "missing_bucket": true
  }
},
"session.geoip.country_iso_code": {
  "terms": {
    "field": "session.geoip.country_iso_code", "missing_bucket": true
  }
},
"session.geoip.region_iso_code": {
  "terms": {
    "field": "session.geoip.region_iso_code", "missing_bucket": true
  }
},
"session.geoip.region_name.keyword": {
  "terms": {
    "field": "session.geoip.region_name.keyword", "missing_bucket": true
  }
}
  },
  "aggregations": {
"session.timestamp.start": {
  "min": {
    "field": "@timestamp"
  }
},
"session.timestamp.end": {
  "max": {
    "field": "@timestamp"
  }
},
"session.timestamp.duration": {
  "bucket_script": {
    "buckets_path": {
      "min": "session.timestamp.start.value",
      "max": "session.timestamp.end.value"
    },
    "script": "params.max - params.min"
  }
}, 

Thing is that when I create the transform, the "missing_bucket : true" is removed and the transform only creates documents where all geoip fields are present.

I would probably be better off writing a script (seems to be the proper way of doing this, anyway).

So, to do it properly, I would only want to group on the "session.id", because all fields under "session" will be identical between documents for a certain "session.id", and I would then want to use a scripted metric to grab the "session" field and subfields and inject that into the transform output document.

With the risk of coming across as a bit of a needy newbie (:slight_smile:) , the first hurdle I've tried to overcome this evening is how a map_script that copies the session field and subfields into the state would look like? I know what I want to do in pseudo-code, but the step to go from there to painless is a bit to big of a step for me to pull off right now. Do you have any better examples than the ones provided in the transforms documentation? Or, can you point me in the right direction with a code snippet that do something similar? I can't muster up the courage to out my utter lack of painless scripting skills by providing my redicilously naive attempts at doing this right now, but if I can get a starting point I can probably figure this out, I hope...

Are you using the kibana advanced editor? It seems you are hitting this bug.

Sorry I forgot this problem, for missing_bucket you have to use dev console.

Example script to get 1 non-null value:

    "aggregations": {
      "s": {
        "scripted_metric": {
          "init_script": "state.value = null",
          "map_script": "if(doc['my_field'].size()!=0) { state.value = doc['my_field'].getValue() }",
          "combine_script": "return state.value",
          "reduce_script": "def v; for (s in states) {if (s != null) {v =s} } return v"
        }
      }

Note: There is no guarantee which non-null value is returned as it depends on shards, distribution of data, index order etc. etc. However, you expect that they are either null or if not null the same, therefore this should work.

Correct, I'm using the editor. That explains it, thanks!

Thanks! I actually had done almost exactly as your example but could not figure out why it failed. Now I understand where I went wrong. Again, thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.