I have an index like this (simplified):
{
"entity_id": {"type": "keyword"},
"attr1": {
"properties": {
"attr1_id": {"type": "keyword"},
"attr1_type": {"type": "keyword"},
"attr1_value": {"type": "double"}
}
},
"attr2": {
"properties": {
"attr2_id": {"type": "keyword"},
"attr2_type": {"type": "keyword"},
"attr2_value": {"type": "double"},
}
}
}
It is heavily denormalized. In the source data, the relationship between entity_id
and attr1/attr2 is one-to-many. In this Elasticsearch index, if an entity_id
has N attr1
and M attr2
there is a total of N * M documents for that entity_id
.
I am trying to do a query that sums all of the attr1_value
but ignores duplicates. I first tried a query like this:
{
"query": {...},
"aggs": {
"total_value_buckets": {
"terms": {
"field": "attr1.attr1_id"
},
"aggs": {
"attr1_value_top": {
"max": {"field": "attr1.attr1_value"}
}
}
},
"total_value": {
"sum_bucket": {
"buckets_path": "total_value_buckets>attr1_value_top"
}
}
}
}
The attr1_value_top
deduplicates the denormalized copies of attr1
. All of the attr1_value
fields for a given attr1.attr1_id
are identical so I just grab one with max(). And this all appears to be working correctly.
The problem is that the total_value
does correctly sum up all of the buckets, it only sums the top 10 buckets that the query happens to return at first. I'd like it to sum up across all the buckets, including the ones that don't get returned.
If it makes it clearer, this would be the equivalent SQL:
SELECT SUM(top_attr1_value) AS total_attr1_value
FROM (
SELECT attr1_id, MAX(attr1_value) AS top_attr1_value
FROM the_index
GROUP BY attr1_id
) intermediate