Hello,
I would like to ask, what would be the best approach when trying to update big amounts of documents.
For our use case we have documents, that need to be enriched with additional data a few hour after they have been index into the system.
The solution, that I came up with is to use update_by_query
, but the tricky part is that in order to calculate some of the new fields we need data from the documents. So we are using a painless script to do the calculations.
{
"script": {
"source": """
ctx._source.avg_revenue = params.revenue;
ctx._source.avg_gross_revenue = params.grossRevenue;
ctx._source.avg_publisher_revenue = (ctx._source.net / ctx._source.gross) * params.revenue;
ctx._source.avg_count = params.count;
""",
"lang": "painless"
}
}
And the we use update_by_query
to update the relevant documents.
PUT /events/_update_by_query?wait_for_completion=false
{
"query": {
"bool": {
"filter": [
{
"term": {
"country": "uk"
}
},
{
"term": {
"category_id": "200161"
}
},
{
"range": {
"@timestamp": {
"format": "yyyy-MM-dd HH:mm:ss Z",
"gte": "2019-12-17 00:00:00 +0000",
"lte": "2019-12-17 01:00:00 +0000"
}
}
}
]
}
},
"script": {
"id": "update-events",
"params": {
"revenue": 20,
"grossRevenue": 200,
"count": 2000
}
}
}
This works just fine, the problem is that we have to do a lot of such updates, as there are a lot of combinations between country
and category_id
, to update all our data we need to fire up around 70 000 requests, from the above(sometimes the updated documents are just one or two, but need to be updated nevertheless).
The solution works, but it take a lot of time, as we run the requests synchronous, in order not to overload the cluster with this task.
I was wondering, if this approach seams reasonable, can I speed up the updating process and are there any other ways to do this, as it looks like a very common use case.
Thanks in advance.