Hi Prashant.
Yes, this can get complicated in a distributed system. You're on the right track but Watch out for non-zero values in doc_count_error_upper_bound in results. If this happens consider increasing shard_size setting to trade RAM for accuracy.
I've got a wizard to help with picking the right strategy for various grouping questions.
https://plnkr.co/edit/iJSFP8eRrhC7l7Hx2XOL?p=preview
The path I took for your example was as follows:
