How is an update handled when an integer field is incremented?


(Jason Baumgartner) #1

I was curious how Lucene handles updates to documents where an integer field is incremented. I am designing a system that will make a lot of script updates to documents by incrementing the value of an integer field and I'm concerned about write amplification on the SSD device.

My hope is that Lucene would be able to do a "hot update" by changing the bytes value within the document without having to delete the document, create a new one and then eventually merge that new document.

How is this handled by Lucene?


(Christian Dahlqvist) #2

An update will result in the full document being reindexed as all segments that make up a shard in Elasticsearch are immutable. If you plan to update documents frequently, you are likely to encounter performance problems as Elasticsearch is not optimized for that. Have a look at this thread for more details.


(Jason Baumgartner) #3

Christian,

Thanks for responding and providing the link to that thread. Much appreciated!

In this situation, I've decided to resort to using Redis to hold values temporarily (for X minutes) for the key in the documents that get updated. There is a trade-off here between having real-time up to the second accurate values for this specific field but hammering the storage system and causing high I/O with constant updates or holding these fields in Redis and flushing out the information back into Elasticsearch periodically.

Basically, my use-case is that I'm storing Reddit submission data and there is a field called "num_comments" that signifies how many comments were made to that submission. Previously, every time a comment came in, I would update the corresponding submission document by incrementing the num_comments field by 1. In order to reduce I/O and the frequency of updates, I use Redis to hold the submission id and increment the num_comments field in Redis and then flush out every 5 minutes. So in that span of five minutes, if a submission had 40 new comments, the old method would require 40 updates of incrementing by 1 to the submission document within Elasticsearch. Now I will flush out every 5 minutes and make one update to the submission document and increment the field by 40.

So for other developers facing similar concerns related to this issue, you can reduce I/O contention and the number of updates to documents by using a service like Redis to cache the increments until you want to flush out those changes.

So like everything else in life, there is a compromise made and in this case, I've chosen to reduce the "real-time"ness of the field and save IO and reduce write amplification. This is a strategy that works well if you can afford to have "near real-time" data.

In fact, you could layer Redis on top of the Elasticsearch data by including the data that is held in cache to supplement the data returned by Elasticsearch, however you lose the ability to accurately search on those fields until you flush.


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.