How is an update handled when an integer field is incremented?

Jason_Baumgartner · May 5, 2018, 5:51am

I was curious how Lucene handles updates to documents where an integer field is incremented. I am designing a system that will make a lot of script updates to documents by incrementing the value of an integer field and I'm concerned about write amplification on the SSD device.

My hope is that Lucene would be able to do a "hot update" by changing the bytes value within the document without having to delete the document, create a new one and then eventually merge that new document.

How is this handled by Lucene?

Christian_Dahlqvist · May 5, 2018, 6:49am

An update will result in the full document being reindexed as all segments that make up a shard in Elasticsearch are immutable. If you plan to update documents frequently, you are likely to encounter performance problems as Elasticsearch is not optimized for that. Have a look at this thread for more details.

Jason_Baumgartner · May 6, 2018, 10:29am

Christian,

Thanks for responding and providing the link to that thread. Much appreciated!

In this situation, I've decided to resort to using Redis to hold values temporarily (for X minutes) for the key in the documents that get updated. There is a trade-off here between having real-time up to the second accurate values for this specific field but hammering the storage system and causing high I/O with constant updates or holding these fields in Redis and flushing out the information back into Elasticsearch periodically.

Basically, my use-case is that I'm storing Reddit submission data and there is a field called "num_comments" that signifies how many comments were made to that submission. Previously, every time a comment came in, I would update the corresponding submission document by incrementing the num_comments field by 1. In order to reduce I/O and the frequency of updates, I use Redis to hold the submission id and increment the num_comments field in Redis and then flush out every 5 minutes. So in that span of five minutes, if a submission had 40 new comments, the old method would require 40 updates of incrementing by 1 to the submission document within Elasticsearch. Now I will flush out every 5 minutes and make one update to the submission document and increment the field by 40.

So for other developers facing similar concerns related to this issue, you can reduce I/O contention and the number of updates to documents by using a service like Redis to cache the increments until you want to flush out those changes.

So like everything else in life, there is a compromise made and in this case, I've chosen to reduce the "real-time"ness of the field and save IO and reduce write amplification. This is a strategy that works well if you can afford to have "near real-time" data.

In fact, you could layer Redis on top of the Elasticsearch data by including the data that is held in cache to supplement the data returned by Elasticsearch, however you lose the ability to accurately search on those fields until you flush.

system · June 3, 2018, 10:29am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Incrementing an integer field Elasticsearch	4	8089	July 6, 2017
Field updates (via LUCENE-5189?) Elasticsearch	2	329	July 6, 2017
Updating only a few fields out of many Elasticsearch	4	370	November 21, 2023
Frequently updated int field Elasticsearch	5	1003	July 6, 2017
Updating a document using Java API Elasticsearch	2	357	July 6, 2017

How is an update handled when an integer field is incremented?

Related topics