{
"chefInfo": {
"id": int,
"employed": String
... Some more recipe information ...
}
"recipe": {
... Some recipe information ...
}
}
If a chef has multiple recipes, the nested chefInfo block will be identical in each document. My problem is that I want to do an aggregation of a field in the chefInfo part of the document. However, this doesn't take into account for the fact that the chefInfo block is a duplicated over multiple documents.
So, if the chef with the id of 1 is on 5 recipes and I am aggregating on the employed field then this particular chef, will represent 5 of the counts in the aggregation, whereas, I want them to only count a single one.
I thought about doing a top_hits aggregation on the chef_id and then I wanted to do a sub-aggregation over all of the buckets but I can't work out how to do the counts over the results of all the buckets.
Diversify on the chefInfo.id field and set max_docs_per_value to 1. Any nested aggregations positioned underneath that will consider at most one doc per chef (per shard).
You may need a large size setting to consider all chefs.
What are you trying to aggregate on the chefs?
If it’s “total years in the biz” then summing values with the ‘sum’ agg will be wrong because of duplicates.
If it”s counting the schools they trained in then ‘cardinality’ agg will be unaffected by duplicates.
OK. So that would be a terms agg on the chef.status field with a child cardinality agg on the chef.id field (assuming chefs don't change status after a particularly bad recipe)
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.