Update_by_query fails to update all documents in ElasticSearch, getting true in between

I have over 25 million documents in Elasticsearch (version - 7.6.2), those are having duplicates among them. So I have written a script in painless to delete one instance of duplicated document. I have tried running the script on test index (having approx 50k records) for that it works fine, but when tested on the main index which is having 25 millions of documents data it usually only process 300k documents and then it stops.

Script/query is running in background using wait for completion also timeout has been mentioned as in days . Also there is cron running which is cleaning/freeing up RAM every 5 minutes and the server space which is FREE is approx 120 GB (which I thought of to be sufficient as the script is not writing anything it is deleting the records).

Query looks like

POST testing_index/_update_by_query?wait_for_completion=false && timeout=4d
{
   "script": {
    "source": """      
      String replace(String word) {
      
        def prefixUrl = 'file://abc.directory.intra/homes/test';
        
        if(word.contains('xyz')) {
          prefixUrl = 'file://xyz.directory.intra/homes/test';
        }
        
        String[] pieces = word.splitOnToken(prefixUrl);
        int lastElIndex = pieces.length - 2;
        pieces[lastElIndex] = '';
        def list = Arrays.asList(pieces);
        return String.join('',list);
      }
 
     
      def path = replace(ctx._source.d_path);
      def matchResult = 'no';
      
      for(int i=0;i<params.pathArray.length;i++) {
        
        if(params.pathArray[i] == path && path!="") {
          
          matchResult = 'yes';
          params.pathArray.remove(i);
        }
      }
      
      if(matchResult == 'yes') {
        ctx.op = 'delete';
        
      } else {
        ctx.op = 'noop';
        params.pathArray.add(path);
      }
      
    """,
    "lang": "painless",
    "params": {
      "pathArray": []
    }
  }
}
  
{
  "completed": true, --why getting true in between
  "task": {
    "node": "Jsecb8kBSdKLC47Q28O6Pg",
    "id": 5968304,
    "type": "transport",
    "action": "indices:data/write/update/byquery",
    "status": {
      "total": 24002005,
      "updated": 0,
      "created": 0,
      "deleted": 333567,
      "batches": 137,
      "version_conflicts": 0,
      "noops": 0,
      "retries": {
        "bulk": 0,
        "search": 0
      },
    
    }
```After processing almost 300k documents it turns out in "true" (same behavior has been seen 2,3 times of testing) and then not able to get why it is not able to process whole documents. When with small number of records it worked well

Reason I doubt is :- The maximum number of statements that can be executed in a loop has been reached

How can I increase this limit so that script can processed all 25 millions of documents in one go.`Preformatted text`

Hi, it seems that the space in use by params is indeed the issue, however using params as a way to pass state between invocations of the script is not supported.

Elasticsearch does not guarantee when new instances of params is materialized but currently that occurs every new segment.

Scripts should be stateless unless the context has been specifically designed to pass state, like scripted metric aggregation.

A reindex request could change the id to a unique value, which is one way to get rid of duplicates, perhaps that would work in this case?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.