Elasticsearch partial update based on Aggregation result

I want to update partially all objects that are based on aggregation result.

Here is my object:

{
	"name": "name",
	"identificationHash": "aslkdakldjka",
	"isDupe": false,
	...
}

My goal is to set isDupe to "true" for all documents where "identificationHash" is there more than 2 times.

Currently what I'm doing is:

  1. I get all the documents that "isDupe" = false with a Term aggregation on "identificationHash" for a min_doc_count of 2.
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "isDupe": {
              "value": false,
              "boost": 1
            }
          }
        }
      ]
    }
  },
  "aggregations": {
    "identificationHashCount": {
      "terms": {
        "field": "identificationHash",
        "size": 10000,
        "min_doc_count": 2
      }
    }
  }
}
  1. With the aggregation result, I do a bulk update with a script where "ctx._source.isDupe=true" for all identificationHash that match my aggregation result.

I repeat step 1 and 2 until there is no more result from the aggregation query.

My question is: Is there a better solution to that problem? Can I do the same thing with one script query without looping with batch of 1000 identification hash?

Aggs will struggle to produce accurate results if the content is distributed across many indices/shards and there are many unique IDs.
Can you share some numbers on these dimensions before we advise approach?

Usually, this algorithm is done on one index at the time.

The application is for file management and analytics. We have one index per day.

We index all the metadata of files in an organization infrastructure. We do that once a day in a new index.

Once the data is indexed, we run the previous algorithm to find all the duplicate files.

The shard will change depending on where the application is installed, but here are some example of how much data we can manage for one index

Index size: 450 GB
Number of documents: 530 million
Unique IDS: 430 million
Primary Shards: 8
Replica: 0

Index size: 50 GB
Number of documents: 40 million
Unique IDS: 30 million
Primary Shards: 3
Replica: 0

Thanks, that's very useful background.
That is challenging. When data is distributed, the terms aggregation can normally be pretty accurate on fields where the cardinality is low e.g finding the most popular browser used in weblogs. This is because each shard can identify the top contenders reasonably accurately and independently - all shards would easily identify chrome, safari etc as the top items and return counts for combination. If you wanted the top 5 browsers each shard would return a larger number of results (shard_size which defaults to 5 multiplied by a small fudge-factor). For the sake of this example each shad may return stats for just their top 10 results which is then fused and sorted to give the final top 5.
This works for things that are very popular, like browser types. Where this analytics approach falls down is when the values are very rare (e.g. like your example of finding things that only occur twice). In this scenario, the chances are that the two duplicates of an ID are to be found on different shards. This means that when each shard is being asked to consider "likely top candidates" it would have to consider returning every ID, even if it occurs only once. Unlike the top-browser search there is no way for each shard to identify promising candidates.This means you'd have to increase shard_size to return enormous lists of all IDs from each shard (which will be constrained by memory limits on responses).
There are 2 categories of solution here:

  1. Your client must do multiple requests to stream sorted lists of IDs from all shards and de-dupe them on the client or
  2. You should organise the data so that same IDs end up on the same shards.

2 is probably the saner option and there's 3 ways you can do it

  • use the transforms api to do this re-organisation into a side index
  • rethink design and make your IDs the actual Elasticsearch IDs of documents (which will ensure there's only one doc per id) or
  • rethink design again and use document routing to send docs with the same ID to the same shard then use the terms agg to find duplicates as you previously outlined

We will probably do something with the transforms API to have the data in a side index.

Thank you