elasticsearch updateByQuery is not working for large data

i have written a script that will create a key "source" on 4 indices of elasticsearch based on some condition . All of my 4 indices contain at least 2~3 million data . when i ran the script it added the key for only 3 entities , then i ran it again for the last entity (deals) exclusively. At last i noticed that some significant amount of docs were not having the new key which i introduced while running script . I can confirm that the conditions written in the script were correct . Here is the script code given below

const path = require('path');
require('dotenv').config({
    path: path.resolve(__dirname, '../.env')
});
const client = require('../server/utils/es-utils').initializeESConnection();

(async function addField(){

    const sq = {
        script: {
            source: `
                if (ctx._source.sourceCreated != null) {
                    if (ctx._source.sourceCreated.recordSource != null) {
                        ctx._source.sourceCreated.source = ctx._source.sourceCreated.recordSource;
                    } else if (ctx._source.sourceCreated.clientType != null) {
                        ctx._source.sourceCreated.source = ctx._source.sourceCreated.clientType.toLowerCase();
                    }
                }
            `,
            lang: "painless"
        }
    };

        const indices = ['contacts','accounts','deals','crm-activities'];
        for (const index of indices) {
            const response = await client.updateByQuery({
                index: index,
                body: sq,
                wait_for_completion: false,
                refresh: true
            });
            console.log(`Updating documents in index ${index}`);
        }
})()

I need to make sure that after running the script , i shall be able to see the key "source" on all of my docs of the given 4 indices

@Karan_Rawat

You use update by query asynchronously, so a tip for your code would be to use the task ID for monitoring the operation. This way, you will know if any documents have conflicts and how many documents were updated.

I was constantly monitoring the tasks from Kibana. No such conflicts came up. One thing I do remember is that when I ran the query without the flag 'wait_for_completion: false', I received a query timeout. Did that lead to some inconsistency?

I can't confirm but maybe yes. In any case, before running the update you could validate how many documents would be updated, so, after applying the update you could compare the updated documents as expected. This would give you security.

1 Like