I have over 25 million documents in Elasticsearch (version - 7.6.2), those are having duplicates among them. So I have written a script in painless to delete one instance of duplicated document. I have tried running the script on test index (having approx 50k records) for that it works fine, but when tested on the main index which is having 25 millions of documents data it usually only process 300k documents and then it stops.
Script/query is running in background using wait for completion also timeout has been mentioned as in days . Also there is cron running which is cleaning/freeing up RAM every 5 minutes and the server space which is FREE is approx 120 GB (which I thought of to be sufficient as the script is not writing anything it is deleting the records).
Query looks like
POST testing_index/_update_by_query?wait_for_completion=false && timeout=4d
{
"script": {
"source": """
String replace(String word) {
def prefixUrl = 'file://abc.directory.intra/homes/test';
if(word.contains('xyz')) {
prefixUrl = 'file://xyz.directory.intra/homes/test';
}
String[] pieces = word.splitOnToken(prefixUrl);
int lastElIndex = pieces.length - 2;
pieces[lastElIndex] = '';
def list = Arrays.asList(pieces);
return String.join('',list);
}
def path = replace(ctx._source.d_path);
def matchResult = 'no';
for(int i=0;i<params.pathArray.length;i++) {
if(params.pathArray[i] == path && path!="") {
matchResult = 'yes';
params.pathArray.remove(i);
}
}
if(matchResult == 'yes') {
ctx.op = 'delete';
} else {
ctx.op = 'noop';
params.pathArray.add(path);
}
""",
"lang": "painless",
"params": {
"pathArray": []
}
}
}
{
"completed": true, --why getting true in between
"task": {
"node": "Jsecb8kBSdKLC47Q28O6Pg",
"id": 5968304,
"type": "transport",
"action": "indices:data/write/update/byquery",
"status": {
"total": 24002005,
"updated": 0,
"created": 0,
"deleted": 333567,
"batches": 137,
"version_conflicts": 0,
"noops": 0,
"retries": {
"bulk": 0,
"search": 0
},
}
```After processing almost 300k documents it turns out in "true" (same behavior has been seen 2,3 times of testing) and then not able to get why it is not able to process whole documents. When with small number of records it worked well
Reason I doubt is :- The maximum number of statements that can be executed in a loop has been reached
How can I increase this limit so that script can processed all 25 millions of documents in one go.`Preformatted text`