Best approach for updating large amount of documents

Hello,

I would like to ask, what would be the best approach when trying to update big amounts of documents.

For our use case we have documents, that need to be enriched with additional data a few hour after they have been index into the system.

The solution, that I came up with is to use update_by_query, but the tricky part is that in order to calculate some of the new fields we need data from the documents. So we are using a painless script to do the calculations.

{
  "script": {
    "source": """
      ctx._source.avg_revenue = params.revenue;
      ctx._source.avg_gross_revenue = params.grossRevenue;
      ctx._source.avg_publisher_revenue = (ctx._source.net / ctx._source.gross) * params.revenue;
      ctx._source.avg_count = params.count;
    """,
    "lang": "painless"
  }
}

And the we use update_by_query to update the relevant documents.

PUT /events/_update_by_query?wait_for_completion=false
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "country": "uk"
          }
        },
        {
          "term": {
            "category_id": "200161"
          }
        },
        {
          "range": {
            "@timestamp": {
              "format": "yyyy-MM-dd HH:mm:ss Z",
              "gte": "2019-12-17 00:00:00 +0000",
              "lte": "2019-12-17 01:00:00 +0000"
            }
          }
        }
      ]
    }
  },
  "script": {
    "id": "update-events",
    "params": {
      "revenue": 20,
      "grossRevenue": 200,
      "count": 2000
    }
  }
}

This works just fine, the problem is that we have to do a lot of such updates, as there are a lot of combinations between country and category_id, to update all our data we need to fire up around 70 000 requests, from the above(sometimes the updated documents are just one or two, but need to be updated nevertheless).

The solution works, but it take a lot of time, as we run the requests synchronous, in order not to overload the cluster with this task.

I was wondering, if this approach seams reasonable, can I speed up the updating process and are there any other ways to do this, as it looks like a very common use case.

Thanks in advance.

Hey,

this approach sounds reasonable to me, despite its performance impact. Wondering if there is any chance to reduce the combinations by providing more parameters to a single update request.

Also, if you have the possibiltiy to recreate the JSON on your side, that might be faster than a scripted update, but this is purely speculation on my side.

How many documents does a single batch update? Also, you could try and run several update requests in parallel to speed up execution, it seems as if those process different documents anyway so there is little chance of two update requests trying to update the same document?

Sadly the our we can not do this, I tried to find a way to reduce the combinations.

If get this right the idea is to fetch all documents, do the calculations outside elastic and then bulk update the documents?

This is how long it takes for one update to complete and how many documents we update with one request.

2020-01-23 17:52:59 INFO: Revenues for gb and 200122 for date 2020-01-22 updated.
{"took":4163,"timed_out":false,"total":8144,"updated":8144,"deleted":0,"batches":9,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
2020-01-23 17:53:05 INFO: Revenues for us and 200122 for date 2020-01-22 updated.
     {"took":5026,"timed_out":false,"total":8318,"updated":8318,"deleted":0,"batches":9,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
2020-01-23 17:53:08 INFO: Revenues for de and 200122 for date 2020-01-22 updated.
     {"took":3109,"timed_out":false,"total":6646,"updated":6646,"deleted":0,"batches":7,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}

There is no chance to update one document with two multiple requests so sending the requests in parallel will speed up things, I am more afraid on the impact such updates will have on the cluster if we do a lot of simultaneous updates, but this can be check only by benchmarks.

Also currently we send send the requests to one of our data nodes, will it be a good idea to setup a ingest node or coordination node and send the updates those node. Will this impact performance at all, I know that ingest nodes intercept bulk requests and as far, as I am aware update_by_query does bulk updates under the hood, am I missing something?

the node that receives the request will also have to execute the get request for that document and apply the updates specified, so spreading this should have some impact in this concrete use-case (slightly different to bulk requests due to changing the received JSON).

@spinscale Thanks for the pointers. I managed to reduce our execution time by sending multiple requests asynchronously. This puts some pressure to the node we send the requests to, but it is something we can live with.

So just to be sure, if we add a new coordination node and send the update requests to that node, the coordination node will run the update script?

I was under the impression that the scripts are run by the nodes that contain the actual documents.

rereading the source I think I was wrong. This is done on the node with the primary shard. See https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/action/update/TransportUpdateAction.java#L160

Got it.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.