How to add field that is only present in some documents?

katja1 · January 24, 2022, 8:40pm

Hi,

I have the following situation: Logs with the same ID [grouping ID] and different fields.


grouping ID: 1    organization: xyz    message: abc
grouping ID: 1    organization: xyz    message: abc
grouping ID: 1                                       person: name

I would like to use Transforms (or anything you suggest, but unfortunately not Logstash), to group all the logs with same ID and have the "enriched" log at the end, meaning it contains all the fields with information:

grouping ID: 1    organization: xyz    message: abc   person: name

So far, I only had situations, where I could use 'group_by' in order to add fields. However, here this is not possible.

I would appreciate your help!

Tomo_M · January 25, 2022, 12:06am

What is your current transform configuration and its result?

katja1 · January 25, 2022, 7:39am

  POST _transform/_preview
    {
 "source": {
   "index": [
        "[my_index]"
    ],
     "query": {
      "bool": {
        "must": [
          {
            "exists": {"field":"grouping_ID" }   
          }
        ]
        
      }
    }
  },
  "pivot": {
    "group_by": {
      "grouping_ID": {
        "terms": {
          "field": "grouping_ID"
       }
     }
   },
   
  "aggregations": {
        
      "organization":{
        "terms": {
          "field": "organization.keyword"
        }
      },
      "message":{
       "terms":{
        "field": "message.keyword"
       }
     },
      "person":{
       "terms":{
        "field": "person.keyword"
       }
     }
    }
        
  },
  "description": "test",
  "dest": {
    "index": "[my_new_index]"
    },
  "frequency": "1m",
  "sync": {
    "time": {
      "field": "time.iso8601",
      "delay": "60s"
      }
    }
  }

This is the closes I got, meaning the destination log includes all the needed fields, but I am aware that the "terms" aggregation is not right for the task.
In my previous work with transforms, I could simply use "group_by" to add the needed fields because all my logs included this field, but if I would do this here (for example, if I would additionally do group_by organization), all the logs not having this field would be lost. I hope you understand what I'm trying to explain.

Probably I will need a scripted metric aggregation, but I didn't get far with it.

przemekwitek · January 25, 2022, 2:01pm

Well, it may be ok if it works for your use-case.

Yes, I see the problem. When you group by a fixed set of fields, the transform requires them to be present. This is working as intended.

If terms don't work for you for any reason, then I think scripted metric aggregation is the way to go. It's a bit more involved, especially if you didn't use the scripting language before, but gives the greatest flexibility.

katja1 · January 25, 2022, 2:31pm

It is not working for my use-case - with terms aggregation I was only able to bring them in the destination log, but the field type is unknown, and the values of the field are written in {}, for example { "organizationNameA" : 3}, with the number 3 representing the count of logs containing that field. So this is a bit unexpected.

Tomo_M · January 25, 2022, 4:24pm

If your problem caused by that some documents miss the field to group_by and grouping such missing documents together is acceptable, one workaround could be to set ingest pipeline to fill such missing field by 'NULL' value.

PUT _ingest/pipeline/set_NULL
{
  "description": "set 'NULL' for missing fields",
  "processors": [
    {"set":{
      "field":"organization",
      "value": "NULL",
      "if":"!ctx.containsKey('organization')"}}
  ]
}

PUT /your_index/_settings
{
  "index": {
    "default_pipeline": "set_NULL"
  }
}

# apply ingest_pipeline to exiting documents.
POST your_index/_update_by_query
{
  "query":{
    "match_all": {}
  }
}

I think that sharing not only sample data that worked well, but also sample data that didn't exactly work well, and presenting what the desired output would be, will advance the discussion.

I made a sample scripted metric aggregation to pick up unique values as an array, something like named "unique values aggregation".
(This script is inspired from this post.)

"unique_organization":{
  "scripted_metric": {
   "init_script": "state.set = new HashSet()",
    "map_script": "if (params['_source'].containsKey(params.field)) {state.set.add(params['_source'][params.field])}",
    "combine_script": "return state.set",
    "reduce_script": "def ret = new HashSet(); for (s in states) {for (k in s) {ret.add(k);}} return ret",
    "params":{
      "field": "organization"
    }
  }
}

katja1 · January 26, 2022, 9:33am

Thank you for your reply. I will try to apply your suggestions and will report the update.

Regarding the following:

I'm sorry if I wasn't clear enough. What I meant was that I cannot use 'group_by' groupingID, organization, message and person, because not all logs contain all fields. All I can do is use group_by on groupingID and then the other fields I need to "bring" to the final log in some other way. My desired output would be as I wrote in the post - a log that contains all the fields with all the information.

przemekwitek · January 26, 2022, 12:30pm

IMO there are 2 more features that you should take into account when designing your solution:

specify person in the group_by clause and set missing_bucket to true so that the buckets without a person field are also returned

You can read more about missing_bucket in the docs: Composite aggregation | Elasticsearch Guide [8.11] | Elastic

specify person in the aggregations clause but use top_metrics aggregation instead of terms.

You can read more about top_metrics in this blog:

katja1 · January 31, 2022, 12:38pm

Thanks for the reply. I'm still playing around with it, but your reply brings valuable tips.

system · February 28, 2022, 12:39pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
[Transforms] How to add an aggregation field? Kibana transforms	4	214	February 18, 2024
Available fields in kibana Kibana	4	1157	February 11, 2020
Field list Logstash	23	2238	February 13, 2020
Kibana Transforms Kibana transforms	4	349	January 18, 2023
Transforms -how to add common fields (i.e. geoip-fields) to the transform output? Elasticsearch transforms	7	913	January 6, 2021

How to add field that is only present in some documents?

Related topics