Update_by_query fails to update all documents in ElasticSearch, getting true in between

PriyaSmartShore · October 17, 2022, 8:11am

I have over 25 million documents in Elasticsearch (version - 7.6.2), those are having duplicates among them. So I have written a script in painless to delete one instance of duplicated document. I have tried running the script on test index (having approx 50k records) for that it works fine, but when tested on the main index which is having 25 millions of documents data it usually only process 300k documents and then it stops.

Script/query is running in background using wait for completion also timeout has been mentioned as in days . Also there is cron running which is cleaning/freeing up RAM every 5 minutes and the server space which is FREE is approx 120 GB (which I thought of to be sufficient as the script is not writing anything it is deleting the records).

Query looks like

POST testing_index/_update_by_query?wait_for_completion=false && timeout=4d
{
   "script": {
    "source": """      
      String replace(String word) {
      
        def prefixUrl = 'file://abc.directory.intra/homes/test';
        
        if(word.contains('xyz')) {
          prefixUrl = 'file://xyz.directory.intra/homes/test';
        }
        
        String[] pieces = word.splitOnToken(prefixUrl);
        int lastElIndex = pieces.length - 2;
        pieces[lastElIndex] = '';
        def list = Arrays.asList(pieces);
        return String.join('',list);
      }
 
     
      def path = replace(ctx._source.d_path);
      def matchResult = 'no';
      
      for(int i=0;i<params.pathArray.length;i++) {
        
        if(params.pathArray[i] == path && path!="") {
          
          matchResult = 'yes';
          params.pathArray.remove(i);
        }
      }
      
      if(matchResult == 'yes') {
        ctx.op = 'delete';
        
      } else {
        ctx.op = 'noop';
        params.pathArray.add(path);
      }
      
    """,
    "lang": "painless",
    "params": {
      "pathArray": []
    }
  }
}
  
{
  "completed": true, --why getting true in between
  "task": {
    "node": "Jsecb8kBSdKLC47Q28O6Pg",
    "id": 5968304,
    "type": "transport",
    "action": "indices:data/write/update/byquery",
    "status": {
      "total": 24002005,
      "updated": 0,
      "created": 0,
      "deleted": 333567,
      "batches": 137,
      "version_conflicts": 0,
      "noops": 0,
      "retries": {
        "bulk": 0,
        "search": 0
      },
    
    }
```After processing almost 300k documents it turns out in "true" (same behavior has been seen 2,3 times of testing) and then not able to get why it is not able to process whole documents. When with small number of records it worked well

Reason I doubt is :- The maximum number of statements that can be executed in a loop has been reached

How can I increase this limit so that script can processed all 25 millions of documents in one go.`Preformatted text`

stu · October 17, 2022, 1:55pm

Hi, it seems that the space in use by params is indeed the issue, however using params as a way to pass state between invocations of the script is not supported.

Elasticsearch does not guarantee when new instances of params is materialized but currently that occurs every new segment.

Scripts should be stateless unless the context has been specifically designed to pass state, like scripted metric aggregation.

A reindex request could change the id to a unique value, which is one way to get rid of duplicates, perhaps that would work in this case?

system · November 14, 2022, 1:55pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
elasticsearch updateByQuery is not working for large data Elasticsearch painless	3	68	July 22, 2024
Script not executing _update_by_query Elasticsearch	5	785	July 6, 2017
Update_by_query Elasticsearch	1	262	February 20, 2023
Update By Query API Elasticsearch	2	494	May 15, 2018
Painless Script to Update Array Elasticsearch painless	1	1433	June 16, 2020

Update_by_query fails to update all documents in ElasticSearch, getting true in between

Related topics