How to sum across all intermediate aggregate buckets?

I have an index like this (simplified):

{
    "entity_id": {"type": "keyword"},
    "attr1": {
        "properties": {
            "attr1_id": {"type": "keyword"},
            "attr1_type": {"type": "keyword"},
            "attr1_value": {"type": "double"}
        }
    },
    "attr2": {
        "properties": {
            "attr2_id": {"type": "keyword"},
            "attr2_type": {"type": "keyword"},
            "attr2_value": {"type": "double"},
        }
    }
}

It is heavily denormalized. In the source data, the relationship between entity_id and attr1/attr2 is one-to-many. In this Elasticsearch index, if an entity_id has N attr1 and M attr2 there is a total of N * M documents for that entity_id.

I am trying to do a query that sums all of the attr1_value but ignores duplicates. I first tried a query like this:

{
    "query": {...},
    "aggs": {
        "total_value_buckets": {
            "terms": {
                "field": "attr1.attr1_id"
            },
            "aggs": {
                "attr1_value_top": {
                    "max": {"field": "attr1.attr1_value"}
                }
            }
        },
        "total_value": {
            "sum_bucket": {
                "buckets_path": "total_value_buckets>attr1_value_top"
            }
        }
    }
}

The attr1_value_top deduplicates the denormalized copies of attr1. All of the attr1_value fields for a given attr1.attr1_id are identical so I just grab one with max(). And this all appears to be working correctly.

The problem is that the total_value does correctly sum up all of the buckets, it only sums the top 10 buckets that the query happens to return at first. I'd like it to sum up across all the buckets, including the ones that don't get returned.

If it makes it clearer, this would be the equivalent SQL:

SELECT SUM(top_attr1_value) AS total_attr1_value
FROM (
    SELECT attr1_id, MAX(attr1_value) AS top_attr1_value
    FROM the_index
    GROUP BY attr1_id
) intermediate

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.