Update_by_query?routing=orgid performance veryslow

Hello All,

/_update_by_query
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "my_key": "myapikey"
          }
        }
      ]
    }
  },
  "script": {
    "source": """
            if (ctx._source[params.field] instanceof String) {
              def fieldValue = ctx._source[params.field];
              ctx._source[params.field] = new ArrayList();
              ctx._source[params.field].add(fieldValue);
              ctx._source[params.field].add(params.value);
            }else if (ctx._source[params.field] instanceof List 
                  && !ctx._source[params.field].contains(params.value)) {
              ctx._source[params.field].add(params.value);
            }
          """,
    "params": {
      "field": "_branches",
      "value": "dev_01_02"
    },
    "lang": "painless"
  }
}

For 50000 records it takes 20 mins. why is it so slow? how do i increase the performance?
By the way, I tried using slices and tried various batch/scroll sizes. Not much impact.
The actual requirement is to update the field of 4 million records. Looks like it will take hours.

Which version of Elasticsearch are you using?

What is the size and specification of your cluster? What type of storage are you using?

How large is the the index you are updating in terms of shard count, size in GB and documents? Are you using nested documents in your mappings?

7.17 version.
16 GB Ram and 300 GB storage, index defined with 3 shards and 100 GB of data is present and as custom routing used the query process on only one shard.
One level of nested field is there, but the query used is not nested and the field to be updated is also not in nested type.

That does not really matter. Nested documents are stored as a collection of documents behind the scenes and all of these are reindexed on any update, which adds overhead. The reason for this is that Lucene uses immutable segments, and all related nested documents need to reside in the same segment.

Updating a lot of documents using update-by-query can therefore result in a lot of disk I/O. If you have slow storage this can become a bottleneck. What type of storage are you using? Local SSD? Have you monitored await and disk utilisation while you run the update-by-query?

Given Query filters only parent documents not nested documents, and updating just 50000 records, will it update complete documents including nested documents? if yes, is there a way to skip updating complete documents? is there a way to just to update the field alone?

We use ElasicCloud, CPU optimized instance.

Yes.

No. All related nested documents need to be in the same segment and as segments are immutable all related documents must be written to the new segment where the update is written.

No. As Elasticsearch and Lucene relies on immutable segments, in place updates are not possible.

To update 4 million records it may take hour. Is there any alternative solution available to update documents faster?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.