Update_by_query?routing=orgid performance veryslow

Sankar_S · January 28, 2024, 11:15am

Hello All,

/_update_by_query
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "my_key": "myapikey"
          }
        }
      ]
    }
  },
  "script": {
    "source": """
            if (ctx._source[params.field] instanceof String) {
              def fieldValue = ctx._source[params.field];
              ctx._source[params.field] = new ArrayList();
              ctx._source[params.field].add(fieldValue);
              ctx._source[params.field].add(params.value);
            }else if (ctx._source[params.field] instanceof List 
                  && !ctx._source[params.field].contains(params.value)) {
              ctx._source[params.field].add(params.value);
            }
          """,
    "params": {
      "field": "_branches",
      "value": "dev_01_02"
    },
    "lang": "painless"
  }
}

For 50000 records it takes 20 mins. why is it so slow? how do i increase the performance?
By the way, I tried using slices and tried various batch/scroll sizes. Not much impact.
The actual requirement is to update the field of 4 million records. Looks like it will take hours.

Christian_Dahlqvist · January 28, 2024, 11:34am

Which version of Elasticsearch are you using?

What is the size and specification of your cluster? What type of storage are you using?

How large is the the index you are updating in terms of shard count, size in GB and documents? Are you using nested documents in your mappings?

Sankar_S · January 28, 2024, 5:41pm

7.17 version.
16 GB Ram and 300 GB storage, index defined with 3 shards and 100 GB of data is present and as custom routing used the query process on only one shard.
One level of nested field is there, but the query used is not nested and the field to be updated is also not in nested type.

Christian_Dahlqvist · January 28, 2024, 6:31pm

That does not really matter. Nested documents are stored as a collection of documents behind the scenes and all of these are reindexed on any update, which adds overhead. The reason for this is that Lucene uses immutable segments, and all related nested documents need to reside in the same segment.

Updating a lot of documents using update-by-query can therefore result in a lot of disk I/O. If you have slow storage this can become a bottleneck. What type of storage are you using? Local SSD? Have you monitored await and disk utilisation while you run the update-by-query?

Sankar_S · January 29, 2024, 7:03am

Given Query filters only parent documents not nested documents, and updating just 50000 records, will it update complete documents including nested documents? if yes, is there a way to skip updating complete documents? is there a way to just to update the field alone?

We use ElasicCloud, CPU optimized instance.

Christian_Dahlqvist · January 29, 2024, 7:04am

Yes.

No. All related nested documents need to be in the same segment and as segments are immutable all related documents must be written to the new segment where the update is written.

No. As Elasticsearch and Lucene relies on immutable segments, in place updates are not possible.

Sankar_S · January 29, 2024, 12:16pm

To update 4 million records it may take hour. Is there any alternative solution available to update documents faster?

system · February 26, 2024, 12:17pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Is it possible to update nested field by query? Elasticsearch	1	1286	July 6, 2017
The Best Practice to Updates Lot of Documents Elasticsearch	4	2081	February 20, 2017
Poor performance updating Elasticsearch	7	1285	July 6, 2017
Add new field to index using update_by_query Elasticsearch	1	1199	May 23, 2018
Update By Query - performance Elasticsearch	1	820	May 17, 2017

Update_by_query?routing=orgid performance veryslow

Related topics