_update_by_query is consuming a lot of disk space

Hi, I created a new field for my indices and running an update_by_query where I'm basically populating that field by aggrigating values from other fields:

{
  "script": {
    "source": 
      "ctx._source.all_names = []; 
       ctx._source.all_names.add(ctx._source.name); 
       if (ctx._source.previous_names_nested != null) { 
          for (int i=0; i<ctx._source.previous_names_nested.length; i++) {
              ctx._source.all_names.add(ctx._source.previous_names_nested[i].company_name)
           }
       } 
       if (ctx._source.other_names != null) {ctx._source.all_names.addAll(ctx._source.other_names)}"
  }
}

While this seems to work, for some reason it's taking WAY more space than expected. Most of the time "previous_names_nested" and "other_names" are null, so only the "name" gets copied. Yet, as I'm running the query and monitoring the disk space, I can see that I will run out of disk space by the time it's finished. Am I doing something wrong here?

An update is basically a DELETE + an INDEX operation.
Delete does not really remove the document at first but mark it as deleted (so it consumes more space when you delete a document).

This old non needed space is eventually removed when segments are merged.
Could you try to run forcemerge API to see if this is better?

If you're updating the whole index, it might be a good idea to actually reindex everything in a new index.

Thanks for the response, I ran forcemerge, nothing happened straight away but the space got reclaimed overnight. Not sure if forcemerge played part in that or if that happened on its own.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.