Quick question, is there any performance difference between updating an existing record vice just reinserting into an index once the index has been deleted to clean out all records? Right now when we want to do a refresh, we are blowing away the index and just pushing all the records back in because we are allowing Elasticsearch to automatically generate the IDs. If we actually assigned meaningful IDs to our records and used those as part of the insert, would it be much faster to just push all the records back in as updates? I know we are not using the best methodology for getting our records in and we will fix that, but I am more curious about any performance difference between an update and reinsert given an update in Elasticsearch is actually a complete reinsertion of the record anyways. Any thoughts would be appreciated. Thanks.
It'd only be faster if you knew you could get away with only updating a portion of the documents. What you are doing is faster if you have to update all documents. Like in lots of data storage technologies, updating a document is an atomic delete and insert (we say "index"). Unlike lots of storage technologies that use tricks like HOT and redo logs to make those operations cheaper for updates, Elasticsearch doesn't have anything like that. And deleted documents hang out in the index until the automatic "merge" process that merges together two of the write-once segments that the index is made of.
There are also performance improvements that make auto-generated ids faster than when you generate the ID.
Finally, the _update API is essentially a way of saving round trips in the GET, modify, PUT sequence that some document updates amount to. So it probably isn't what you want either.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.