Aggregate over top hits ElasticSearch

haych1702 · May 23, 2019, 7:51am

My documents are structured in the following way:

{
   "chefInfo": {
      "id": int,
      "employed": String
      ... Some more recipe information ...
   }
   "recipe": {
      ... Some recipe information ...
   }
}

If a chef has multiple recipes, the nested chefInfo block will be identical in each document. My problem is that I want to do an aggregation of a field in the chefInfo part of the document. However, this doesn't take into account for the fact that the chefInfo block is a duplicated over multiple documents.

So, if the chef with the id of 1 is on 5 recipes and I am aggregating on the employed field then this particular chef, will represent 5 of the counts in the aggregation, whereas, I want them to only count a single one.

I thought about doing a top_hits aggregation on the chef_id and then I wanted to do a sub-aggregation over all of the buckets but I can't work out how to do the counts over the results of all the buckets.

Is it possible what I want to do?

Mark_Harwood · May 23, 2019, 11:18am

If you have a single shard you can use the diversified_sampler aggregation.

Diversify on the chefInfo.id field and set max_docs_per_value to 1. Any nested aggregations positioned underneath that will consider at most one doc per chef (per shard).
You may need a large size setting to consider all chefs.

haych1702 · May 23, 2019, 11:58am

We use multiple shards.

I have had some feedback that a cardinality aggregation might do the job. What do you think about using them?

Mark_Harwood · May 23, 2019, 12:03pm

What are you trying to aggregate on the chefs?
If it’s “total years in the biz” then summing values with the ‘sum’ agg will be wrong because of duplicates.
If it”s counting the schools they trained in then ‘cardinality’ agg will be unaffected by duplicates.

haych1702 · May 23, 2019, 2:24pm

So if there were 3 recipes:

[
  {
    "chef": {
      "id": 1,
      "status": "Employed"
    },
    "recipe": {
      "name": "toast"
    }
  },
  {
    "chef": {
      "id": 1,
      "status": "Employed"
    },
    "recipe": {
      "name": "eggs"
    }
  }
  {
    "chef": {
      "id": 2,
      "status": "Unemployed"
    },
    "recipe": {
      "name": "sausages"
    }
  }
]

I want to be able to aggregate on status. I should get the following result:

{
   "employed": 1,
   "unemployed": 1
}

Because the first two recipes belong to the same chef and employed should only count once.

Mark_Harwood · May 23, 2019, 2:27pm

OK. So that would be a terms agg on the chef.status field with a child cardinality agg on the chef.id field (assuming chefs don't change status after a particularly bad recipe)

haych1702 · May 24, 2019, 7:12am

That is great. Thank you.

system · June 21, 2019, 7:12am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Aggregation over aggregation on another field + top_hits Elasticsearch	2	502	November 4, 2022
Aggregation over top hits Elasticsearch	1	177	July 23, 2022
[QUERY] Top hits of nested fields sum aggregated by parent document Elasticsearch	3	22	November 12, 2024
Aggregation beginner question - please help Elasticsearch	1	695	July 5, 2017
Top_hits aggregation on a serial_diff Elasticsearch	2	588	July 5, 2017

Aggregate over top hits ElasticSearch

Related topics