Nested aggregation count seems incorect

ES v2.1.1

I have a data structure containing embedded history, where a docs history is an array of history objects, each of which contains a number of fields.

Unfortunately the avail space for an issue here is limited to 5000 chars - so I had to resort to external links! This is a real shame as these will disappear in time ...

Map extract

I need to create aggregate queries with conditions on the parent doc as well as on multiple history fields.

Query

Query Result

This query comes close to the desired result. My data contains the following:

11 docs which contain only a history entry with action_name implement
21 docs which contain only a history entry with action_name submit
23 docs which contain both an implement and a submit history entry.

Hence a search for just submited items should (and does) yield 21 + 23 = 44 hits
A search for just implemented items should (and does) yield 11 + 23 = 34 hits

However, when querying on both aggregating on action_name as per the given query,
the bucket doc_counts contain 55/56 for submit (should be 44) , 34 for implement (correct).

I do not understand the 55/56 discrepancy, and also noted that 55 is actually
the sum of all docs (11 + 21 + 23).

So it seems the count for submit results is actually the sum for implement and submit ??

Hopefully this is a problem in my query - can anyone point me to it?

And as anticipated the issue was in the query. The entire nested sub-query must be repeated in the aggregation as a filter - as shown below, now results are as expected!

BTW - providing nicely formatted (indented) json here consumes all my available post-characters in white-space!

{"query":{"filtered":{"filter":{"bool":{"must":[{"term":{"domain.acronym":"dom"}},{"bool":{"must":[{"term":{"company.name":"abc"}},{"bool":{"must":[{"term":{"severity":"important"}},{"nested":{"path":"history","inner_hits":{"size":50,"name":"history"},"query":{"bool":{"must":[{"terms":{"history.action_name":["implement","submit"]}},{"bool":{"must":[{"range":{"history.created_at":{"gte":"2014-04"}}},{"range":{"history.created_at":{"lt":"2016-07"}}}]}}]}}}}]}}]}}]}}}},"size":0,"aggs":{"company.name":{"terms":{"size":0,"field":"company.name.raw"},"aggs":{"severity":{"terms":{"size":100,"field":"severity.raw"},"aggs":{"history":{"nested":{"path":"history"},"aggs":{"by_action":{"filter":{"bool":{"must":[{"terms":{"history.action_name":["implement","submit"]}},{"bool":{"must":[{"range":{"history.created_at":{"gte":"2014-04"}}},{"range":{"history.created_at":{"lt":"2016-07"}}}]}}]}},"aggs":{"my_name":{"terms":{"field":"history.action_name","size":100},"aggs":{"container":{"reverse_nested":{},"aggs":{"stats_number":{"stats":{"field":"number"}}}}}}}}}}}}}}}}

Maybe you want to use the reverse_nested aggregation in order to count top-level documents rather than inner documents?

Thanks for your comment jpountz. Any advice is welcome! The thinking of the current query is:

Queries are built by a GUI. Queries may be aggregated or not. The first part of the query (excl the aggs) is always the same regardless whether aggregated or not. In the case where they are not - the inner_hits on the nested history items is used to obtain the array offsets of the history entries (needed to retrieve & display only the matched hits).

In the aggregation case there already is a reverse_nested clause to obtain parent stats.

Did I miss something perhaps?