Nested aggregations and dealing with doublecounting


(Alex Roytman) #1

I created an issue https://github.com/elastic/elasticsearch/issues/16380 if you are interested. Here is the problem and few possible solutions. I would appreciate your thoughts .

Alex-

Let me start with an example to explain the issue which is a major pain point for us when using nested aggregations.

Imagine having Teams that handle Requests of certain Priority. Each request may have multiple teams assigned to it and each team request assignment has number of hours allocated for this team on this request

request:{
  priority:1,
  teams:[
    {name:'team1', hours:10},
    {name:'team2', hours:20},
  ]
}

I need a report that gives me number of hours by team by priority. Doing nested aggregation on teams.name and then reverse nested to group on priority and then nested to sum teams.hours double-counts hours because second nesting on teams knows nothing about upstream nesting as it is executed in context of request and as result it will lump hours for each team on request under the top level team aggregation.

Please do not suggest :slight_smile: that I can aggregate on Priority first and then on Teams and then flip/collate my nested result. Yes I can but this example is a small portion of a report full of similar cases not to mention that it would not work in our generic solution where users can pick and choose any group by and metrics at any level of nesting

A very practical, generic, concise and easy to use solution would be to allow aggregating on properties of owner document(s) from a nested aggregation without having to change context back to parent document buy using reverse nested aggregation. In our example as simple as doing terms sub aggregation on "../priority" from teams nested document context.

Another solution would be to support filtering expressions that are based on upstream aggregation values so that i can filter lowest teams nested aggregation calculating hours based on top level team aggregation. it could be useful for many scenarious but for this one it is a bit artificial. I think it would be a great capability to be able to use parent agg bukets as input for child bucket filters etc

Another, and in my opinion pretty powerful solution would be ability to flatten/de-normalize nested aggregation (like Cartesian join) prior to applying filters/aggregations. This will be much more useful in very many cases than aggregation on nested data as it is available now.

Sorry for a rant below please do read it:-)

As a person who has to deal with lots of nested data and produce complicated analytic on it I can tell you that this is the single most difficult part. After promoting elastic heavily instead of traditional star-schema based data-mart we suffer a lot trying to produce reports on nested data. One of selling point was that we can store complex data in its natural form and filter/search it, report/export it and also produce analytic from it without creating dozens of fragmented single purpose de-normalized data-marts. As far as searching it works fairly well except for a similar issue of not being able to filter nested data when one of the criteria is on that nested data element and users do not wish to see nested records that do not match the criteria (this would have happened naturally and for free with relational database). With aggregation it is even worse - we cant escape this kind of double counting without all sorts of shenanigans.
If only handling of nested fields were richer we could have gotten much further with it. You may say de-normalize all your nested relationships into separate indices. yes it could be done but it defeats the purpose and given lack of joins it will lead to enormous data duplication and won't be a generic solution anyways. I would love if elastic could do de-normalization on the fly on specific nested fields and apply filtering/aggregation logic to the denormalized dataset

It (the nested de-normalization) would also be extremely useful for plain search scenario particularly when combined with ability to return selected snippets of both owner and nested objects


(system) #2