Elasticsearch partial update based on Aggregation result

Mark_Harwood · November 25, 2021, 5:19pm

Thanks, that's very useful background.
That is challenging. When data is distributed, the terms aggregation can normally be pretty accurate on fields where the cardinality is low e.g finding the most popular browser used in weblogs. This is because each shard can identify the top contenders reasonably accurately and independently - all shards would easily identify chrome, safari etc as the top items and return counts for combination. If you wanted the top 5 browsers each shard would return a larger number of results (shard_size which defaults to 5 multiplied by a small fudge-factor). For the sake of this example each shad may return stats for just their top 10 results which is then fused and sorted to give the final top 5.
This works for things that are very popular, like browser types. Where this analytics approach falls down is when the values are very rare (e.g. like your example of finding things that only occur twice). In this scenario, the chances are that the two duplicates of an ID are to be found on different shards. This means that when each shard is being asked to consider "likely top candidates" it would have to consider returning every ID, even if it occurs only once. Unlike the top-browser search there is no way for each shard to identify promising candidates.This means you'd have to increase shard_size to return enormous lists of all IDs from each shard (which will be constrained by memory limits on responses).
There are 2 categories of solution here:

Your client must do multiple requests to stream sorted lists of IDs from all shards and de-dupe them on the client or
You should organise the data so that same IDs end up on the same shards.

2 is probably the saner option and there's 3 ways you can do it

use the transforms api to do this re-organisation into a side index
rethink design and make your IDs the actual Elasticsearch IDs of documents (which will ensure there's only one doc per id) or
rethink design again and use document routing to send docs with the same ID to the same shard then use the terms agg to find duplicates as you previously outlined

Topic		Replies	Views
Is there a way to update by Aggregation results Elasticsearch painless	1	443	August 22, 2020
Could a custom Aggregator be used for general purpose Map/Reduce or bulk update? Elasticsearch	2	397	July 6, 2017
Using aggregation in Elasticsearch to find duplicate data in the same index Elasticsearch painless	6	15828	January 22, 2021
Aggregation after dedup across indices Elasticsearch	1	477	June 12, 2018
Duplicate Field Aggregate crashes system Elasticsearch	2	1579	May 28, 2017

Elasticsearch partial update based on Aggregation result

Related topics