Background Update of Millions of Documenrs Using NEST API

What is the most efficient way of updating Millions of documents in the background so that it has little effect on foreground operations such as Autocomplete and user searches.

Using a bulk may not be ideal if you do them all in one batch, so I guess it depends how often they are updated.

But there's no special way to go about this if that is what you are asking, the cost of an update needs to be paid whether it's 1 or more.

Thanks for your response.
I appreciate that the cost of the update has to be paid - I was just trying to establish if there is a background process that can handle this without affecting the normal operation of querying indices.

No, there is not. Any indexing or updating will add load to the cluster and could affect searches. If the index is otherwise not constantly receiving new data or updates and the updated documents does not need to be immediately available for search you might be able to clone the index and move this to a new temporary node (that only holds this data) or another cluster. You could then update this index without affecting the load on the other production nodes. Once completed you could move the index to the production node(s), delete the original index and create an alias that points to the new index (or change the application to use the new index).

1 Like

Hi. Thanks for yoyr reply. This would then make the 2 indices out of sync if there are other minor/small changes on the original index.
Thanks anyway. We will have to see how to solve htis problem in some other way.

you can do something like this.

stop any data coming in for few second.
make a copy of this index-A (snapshot) and delete it.

start new data on same index name-A ( now this index will only has new data, old is gone as you made a copy)

now you restore that snapshot with different name (Index B) in same cluster or other. make require changes

now read each record from Index-B and copy to Index-A

I have done such method in past.

Great suggestion. Thanks.
However, we wanted to do this in the background in order not to affect normal ops such as searching/autocomplete etc.

If we have to copy each record, this will slow down other ops .
But thanks for your response.

If you are continously are making changes to the indices my approach will not work and I suspect you may need to update the index directly and make sure you throttle the update rate so that search latency is not affected in any major way.

Could you please provide some more context/info around the use-case that prompted this question? The initial question is somewhat vague and there might be an alternative solution to the actual use-case here, rather than the question you're asking.

Hi. Thanks for your response.
Basically everytime a user does a search for say ProductA using IndexA, say we get 1,000,000 documents that staisfy the search using a Filter.

What we would like to do is to have a field "popularity" and increment that field everytime this product satisfies a filter condition something like a popularity index i.e. how many times did this product satisfiy the search conditions on ALL the searches that has been performed since the creation of the index.

Thanks for the additional detail, I think there are a few approaches to this, but they might not be all optimal:

  1. Elastic Enterprise Search has a way to track these types of metrics, you might be able to look at using this, or at least look into how they're implementing it to get a similar outcome.
  2. At the "client" level, you could have a separate "metrics" index/cluster. Whenever someone performs a search, you can then call a separate function that updates this index/cluster with the data. Therefore you don't affect the main index/cluster, while still being able to track the data.
  3. You can use the update by query API to update the docs in the background by settings wait_for_completion to false. You can in theory then tune the update query params to your needs to not affect search performance.

A general thing to think about though for this use case; How will this information be useful to your overall goals? Tracking millions of documents "hit" counts for searches seems like a lot of information that might not be entirely relevant. Maybe it would be more beneficial to track the top X number of results for a search, and then also track the results which the user clicks on/views. Therefore, allowing you to determine relevance of docs that way.

Thanks very much in taking the time in providing these options.

The use case we have in our application for data analysis requires this metric in order to compute and make sense of other relevant information.

But we will look at the Enterprise option as suggested.

Thanks again.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.