With the aggregation result, I do a bulk update with a script where "ctx._source.isDupe=true" for all identificationHash that match my aggregation result.
I repeat step 1 and 2 until there is no more result from the aggregation query.
My question is: Is there a better solution to that problem? Can I do the same thing with one script query without looping with batch of 1000 identification hash?
Aggs will struggle to produce accurate results if the content is distributed across many indices/shards and there are many unique IDs.
Can you share some numbers on these dimensions before we advise approach?
Thanks, that's very useful background.
That is challenging. When data is distributed, the terms aggregation can normally be pretty accurate on fields where the cardinality is low e.g finding the most popular browser used in weblogs. This is because each shard can identify the top contenders reasonably accurately and independently - all shards would easily identify chrome, safari etc as the top items and return counts for combination. If you wanted the top 5 browsers each shard would return a larger number of results (shard_size which defaults to 5 multiplied by a small fudge-factor). For the sake of this example each shad may return stats for just their top 10 results which is then fused and sorted to give the final top 5.
This works for things that are very popular, like browser types. Where this analytics approach falls down is when the values are very rare (e.g. like your example of finding things that only occur twice). In this scenario, the chances are that the two duplicates of an ID are to be found on different shards. This means that when each shard is being asked to consider "likely top candidates" it would have to consider returning every ID, even if it occurs only once. Unlike the top-browser search there is no way for each shard to identify promising candidates.This means you'd have to increase shard_size to return enormous lists of all IDs from each shard (which will be constrained by memory limits on responses).
There are 2 categories of solution here:
Your client must do multiple requests to stream sorted lists of IDs from all shards and de-dupe them on the client or
You should organise the data so that same IDs end up on the same shards.
2 is probably the saner option and there's 3 ways you can do it
use the transforms api to do this re-organisation into a side index
rethink design and make your IDs the actual Elasticsearch IDs of documents (which will ensure there's only one doc per id) or
rethink design again and use document routing to send docs with the same ID to the same shard then use the terms agg to find duplicates as you previously outlined
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.