Elasticsearch terms aggregation with partition does not honor the “size” value

N is the number of partitions you want to break a problem into.
We use the same algorithm to route a document with an ID to a particular choice of server.
It's a common technique and this doc discusses it.

The challenge with running aggregations on a distributed data store is you might have many unique values and want to limit the chatter required between data nodes. Take, for example, a store of tweets with many data nodes - each holding a month's worth of tweets.
Let's say you wanted to query across them all and look at all the Twitter user handles to see who had tweeted about "covid19" and how many times they had done that. That is likely to be a lot of unique user accounts so to page through them all you'd need to break the task into multiple requests. If you didn't care about the sort order you could simply page through them alphabetically using the composite aggregation with the after parameter (rather than the terms aggregation). Each of the shards can independently agree that a comes before b in the alphabet so they can agree without coordination which are the next set of accounts to return for consideration.
However, if you wanted to look at a list of the first people to tweet about covid19 the task is much harder because you'd want to sort the account IDs by ascending min date. In a distributed store this is very hard without streaming a lot of data between nodes - unlike an alphabetic sort order the data nodes have no idea if @DrFauci or @JoeShmoe should be the next candidate to be returned - they don't know for sure who tweeted first about covid 19. What we can do to make the computation simpler is make each request focus on a small subset (aka "partition") of all the millions of unique terms and sort just those by first-tweet-date. If the partitions and size of results returned are sufficiently small the sorted results from each shard for a request can be fused in a way which should be accurate.
This kind of distributed analytics is made complex in the same way the old fox, chicken and grain transportation problem is made hard by the constraint of a small boat.

I appreciate there's a lot of complex choices here so I developed a wizard you can run to walk through the options and make a choice.

2 Likes