Aggregate over top hits ElasticSearch

My documents are structured in the following way:

{
   "chefInfo": {
      "id": int,
      "employed": String
      ... Some more recipe information ...
   }
   "recipe": {
      ... Some recipe information ...
   }
}

If a chef has multiple recipes, the nested chefInfo block will be identical in each document. My problem is that I want to do an aggregation of a field in the chefInfo part of the document. However, this doesn't take into account for the fact that the chefInfo block is a duplicated over multiple documents.

So, if the chef with the id of 1 is on 5 recipes and I am aggregating on the employed field then this particular chef, will represent 5 of the counts in the aggregation, whereas, I want them to only count a single one.

I thought about doing a top_hits aggregation on the chef_id and then I wanted to do a sub-aggregation over all of the buckets but I can't work out how to do the counts over the results of all the buckets.

Is it possible what I want to do?

If you have a single shard you can use the diversified_sampler aggregation.

Diversify on the chefInfo.id field and set max_docs_per_value to 1. Any nested aggregations positioned underneath that will consider at most one doc per chef (per shard).
You may need a large size setting to consider all chefs.

We use multiple shards.

I have had some feedback that a cardinality aggregation might do the job. What do you think about using them?

What are you trying to aggregate on the chefs?
If it’s “total years in the biz” then summing values with the ‘sum’ agg will be wrong because of duplicates.
If it”s counting the schools they trained in then ‘cardinality’ agg will be unaffected by duplicates.

So if there were 3 recipes:

[
  {
    "chef": {
      "id": 1,
      "status": "Employed"
    },
    "recipe": {
      "name": "toast"
    }
  },
  {
    "chef": {
      "id": 1,
      "status": "Employed"
    },
    "recipe": {
      "name": "eggs"
    }
  }
  {
    "chef": {
      "id": 2,
      "status": "Unemployed"
    },
    "recipe": {
      "name": "sausages"
    }
  }
]

I want to be able to aggregate on status. I should get the following result:

{
   "employed": 1,
   "unemployed": 1
}

Because the first two recipes belong to the same chef and employed should only count once.

OK. So that would be a terms agg on the chef.status field with a child cardinality agg on the chef.id field (assuming chefs don't change status after a particularly bad recipe)

That is great. Thank you. :grinning:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.