I am trying to understand how sorting on the sub aggregation works and how accurate it is when I have millions of documents in multiple shards.
For terms aggregation, when the "size" parameter is specified, it selects the top N terms in each shard and then these are combined to return the final top N terms and their count.
When we sort by metrics sub aggregation, I am trying to understand how does it select terms in each shard. will it collect top N terms ordered by sub aggregation count from each shard? Or it just collects the top N terms in each shard irrespective of the sub aggregation count and then sort them by sub aggregation count?
See shard_size settings for how accuracy can be tuned. It gives you the control to determine how many terms to consider from each shard in interim results rather than the size of the final result. It helps reduce the potential for error at the expense of larger memory usage. You can ask for details on the error margin of a search.
Note that for the sorts of aggregations it looks like you are trying to do it might make sense to consider partitioning your analysis of the data using this new terms agg feature in 5.2,
In each shard, will it select the top 100 "ItemId" bucket (irrespective of their sum sub-aggregate on "Quantity") and then sort it by sum sub-aggregate on "Quantity" ? Or in each shard it will actually select the "ItemId" buckets that have top 100 sum sub-aggregation value (even if the "ItemId" count is not in top 100 i.e. the "ItemId" count is less but they have big "Quantity" value) ?
The latter - but obviously shard-local selections are only using shard-local values for quantities. The merged results may understate true global quantities if a shard has not included results for a particular item (e.g. could happen if an item ranks highly on most shards but very poorly on one)