How to add field that is only present in some documents?

Hi,

I have the following situation: Logs with the same ID [grouping ID] and different fields.


grouping ID: 1    organization: xyz    message: abc
grouping ID: 1    organization: xyz    message: abc
grouping ID: 1                                       person: name

I would like to use Transforms (or anything you suggest, but unfortunately not Logstash), to group all the logs with same ID and have the "enriched" log at the end, meaning it contains all the fields with information:

grouping ID: 1    organization: xyz    message: abc   person: name                                                           

So far, I only had situations, where I could use 'group_by' in order to add fields. However, here this is not possible.

I would appreciate your help!

What is your current transform configuration and its result?

  POST _transform/_preview
    {
 "source": {
   "index": [
        "[my_index]"
    ],
     "query": {
      "bool": {
        "must": [
          {
            "exists": {"field":"grouping_ID" }   
          }
        ]
        
      }
    }
  },
  "pivot": {
    "group_by": {
      "grouping_ID": {
        "terms": {
          "field": "grouping_ID"
       }
     }
   },
   
  "aggregations": {
        
      "organization":{
        "terms": {
          "field": "organization.keyword"
        }
      },
      "message":{
       "terms":{
        "field": "message.keyword"
       }
     },
      "person":{
       "terms":{
        "field": "person.keyword"
       }
     }
    }
        
  },
  "description": "test",
  "dest": {
    "index": "[my_new_index]"
    },
  "frequency": "1m",
  "sync": {
    "time": {
      "field": "time.iso8601",
      "delay": "60s"
      }
    }
  }

This is the closes I got, meaning the destination log includes all the needed fields, but I am aware that the "terms" aggregation is not right for the task.
In my previous work with transforms, I could simply use "group_by" to add the needed fields because all my logs included this field, but if I would do this here (for example, if I would additionally do group_by organization), all the logs not having this field would be lost. I hope you understand what I'm trying to explain.

Probably I will need a scripted metric aggregation, but I didn't get far with it.

Well, it may be ok if it works for your use-case.

Yes, I see the problem. When you group by a fixed set of fields, the transform requires them to be present. This is working as intended.

If terms don't work for you for any reason, then I think scripted metric aggregation is the way to go. It's a bit more involved, especially if you didn't use the scripting language before, but gives the greatest flexibility.

It is not working for my use-case - with terms aggregation I was only able to bring them in the destination log, but the field type is unknown, and the values of the field are written in {}, for example { "organizationNameA" : 3}, with the number 3 representing the count of logs containing that field. So this is a bit unexpected.

If your problem caused by that some documents miss the field to group_by and grouping such missing documents together is acceptable, one workaround could be to set ingest pipeline to fill such missing field by 'NULL' value.

PUT _ingest/pipeline/set_NULL
{
  "description": "set 'NULL' for missing fields",
  "processors": [
    {"set":{
      "field":"organization",
      "value": "NULL",
      "if":"!ctx.containsKey('organization')"}}
  ]
}

PUT /your_index/_settings
{
  "index": {
    "default_pipeline": "set_NULL"
  }
}

# apply ingest_pipeline to exiting documents.
POST your_index/_update_by_query
{
  "query":{
    "match_all": {}
  }
}

I think that sharing not only sample data that worked well, but also sample data that didn't exactly work well, and presenting what the desired output would be, will advance the discussion.

I made a sample scripted metric aggregation to pick up unique values as an array, something like named "unique values aggregation".
(This script is inspired from this post.)

"unique_organization":{
  "scripted_metric": {
   "init_script": "state.set = new HashSet()",
    "map_script": "if (params['_source'].containsKey(params.field)) {state.set.add(params['_source'][params.field])}",
    "combine_script": "return state.set",
    "reduce_script": "def ret = new HashSet(); for (s in states) {for (k in s) {ret.add(k);}} return ret",
    "params":{
      "field": "organization"
    }
  }
}
1 Like

Thank you for your reply. I will try to apply your suggestions and will report the update.

Regarding the following:

I'm sorry if I wasn't clear enough. What I meant was that I cannot use 'group_by' groupingID, organization, message and person, because not all logs contain all fields. All I can do is use group_by on groupingID and then the other fields I need to "bring" to the final log in some other way. My desired output would be as I wrote in the post - a log that contains all the fields with all the information.

IMO there are 2 more features that you should take into account when designing your solution:

  1. specify person in the group_by clause and set missing_bucket to true so that the buckets without a person field are also returned

You can read more about missing_bucket in the docs: Composite aggregation | Elasticsearch Guide [7.16] | Elastic

  1. specify person in the aggregations clause but use top_metrics aggregation instead of terms.

You can read more about top_metrics in this blog:

2 Likes

Thanks for the reply. I'm still playing around with it, but your reply brings valuable tips.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.