We have a scenario where in our domain entity undergoes very high frequency updates. Say a max of 100 updates for a domain entity in a minute. Currently we plan to index a summary of the domain information (few attributes of the domain used in search) into Elasticsearch and use Elasticsearch for all search scenario in the application.
With the domain getting updated frequently, we have two approaches left
For every update construct the updatebyquery syntax and update the record in elasticsearch
For every update, get the latest from Elasticsearch and index a new document into elasticsearch. During search fetch the latest of document matching the search query
In approach #1, total number of documents in the index will be growing linearly with the number of domain created in the system. This will make the maintaining & archiving the index simple.
In approach #2, the total number of documents will grow exponentially based on the updates to every domain entity in the system. This will result in index rolling once it reaches the max size and too many documents will be of oldest version which will never be included in search. This means, those old version documents has to be purged from the index. For this a background job has to be executed to purge the old version documents.
When implementing approach #1, under high traffic usage we are already too many encountering 409 conflicts and most of it passes on re-try. We are in kind of moving to approach#2.
Any recommendation in choosing the better approach. Is there any other approach that can be followed here ?
I would suggest you use the bulk API for your updates, and set the retry_on_conflict parameter high enough to achieve success. It's also best to resolve some of these conflicts outside of ES first - e.g. rather than sending a bulk request that updates the same document several times, combine all those updates into a single write.
We are also thinking of combining updates for certain amount of time and perform single update. This means user will not be seeing the latest data and we are convincing the business on the same.
Any idea on which design approach would be better suited in this case considering we have high frequent updates which is going to increase as system grows older and more users added into the system.
We do have one scenario where the frequency of update will be very less say max of 50 per day. But every update will result in updating 40k documents being updated in Elasticsearch. Good example would be project location being changed and all project associated with the location needs to be updated. In some cases, we have seen 40k to 50k records getting updated.
Most of the time the elasticsearch timeouts and increasing timeout didnt help. setting waitforcompletion flag set to false and tracking the taskid asynchronously will help here ?
Also, doing 40k updates in a single transaction impact Elasticsearch in any way ?
What i assumed is that the Elasticsearch is great for immutable data. Updating too many documents or more frequently might impact Elasticsearch was my assumption so far.
If we do the right way, elasticsearch can also handle large updates and high frequency updates.
Elasticsearch can handle updates of large number of documents quite well, but frequent updates of individual documents add a lot of overhead and can negatively affect performance.
Also note that a single update hitting 50k records is not really "large", and 100 updates per minute is not really "high frequency". Elasticsearch scales many orders of magnitude past these numbers.
David on thinking further, bulkapi is for updating many documents at the same time. Here we are updating same document repeatedly with few milliseconds between each update. Is it ok to use bulkapi for single document update? is retry_on_conflict is available only for bulkapi and not for other api ..we are using .NET client 7.17 UpdateByQueryAsync in which i dont find any retry_on_conflict parameter.
It'd be better to avoid having multiple updates in flight for each document, especially if they're only a few milliseconds apart. I mean Elasticsearch will handle this ok but it's just an obvious recipe for conflicts. Instead, send one update and wait for it to complete, then send the next update containing all the changes that arrived in the meantime.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.