Thanks, that's very useful background.
That is challenging. When data is distributed, the terms aggregation can normally be pretty accurate on fields where the cardinality is low e.g finding the most popular browser used in weblogs. This is because each shard can identify the top contenders reasonably accurately and independently - all shards would easily identify chrome, safari etc as the top items and return counts for combination. If you wanted the top 5 browsers each shard would return a larger number of results (shard_size which defaults to 5 multiplied by a small fudge-factor). For the sake of this example each shad may return stats for just their top 10 results which is then fused and sorted to give the final top 5.
This works for things that are very popular, like browser types. Where this analytics approach falls down is when the values are very rare (e.g. like your example of finding things that only occur twice). In this scenario, the chances are that the two duplicates of an ID are to be found on different shards. This means that when each shard is being asked to consider "likely top candidates" it would have to consider returning every ID, even if it occurs only once. Unlike the top-browser search there is no way for each shard to identify promising candidates.This means you'd have to increase shard_size to return enormous lists of all IDs from each shard (which will be constrained by memory limits on responses).
There are 2 categories of solution here:
- Your client must do multiple requests to stream sorted lists of IDs from all shards and de-dupe them on the client or
- You should organise the data so that same IDs end up on the same shards.
2 is probably the saner option and there's 3 ways you can do it
- use the
transformsapi to do this re-organisation into a side index - rethink design and make your IDs the actual Elasticsearch IDs of documents (which will ensure there's only one doc per id) or
- rethink design again and use document
routingto send docs with the same ID to the same shard then use the terms agg to find duplicates as you previously outlined