when a ES Index stores millions [lots of millions] documents [Es Version 1.4.4; 5 primary shards with 1 replica shard per index] and you have many document/data updates in short periods of time:
Would you make an update per document or did you use an "revision management" and insert a new document with each change? (Considering the performance aspect)
Behind the scene, update means delete the existing one then add a new one so add in theory should be better, especially when you are dealing with millions of documents.
Adding only leaves you with "duplicated documents" meaning there will be at least two or more documents in the index that have very similar contents (or minor differences) The definition about a dup is varied based on the data domain and business needs so don't take it personal when someone says "your definition of a dup is wrong"
For example, if a document is about a person and his/her address. Version 1 has one address, version 2 has a different address. Ask yourself, what does your business want to do with this? If it only prefers the most up to date address, then you need to do an update, not an add. If it wants to keep a history about one's addresses, then you need to do an add.
Thx for your answer. I am want only use update operations - on the one hand because of the duplicate data problem - but i need arguments for this [e.g. performance is not significantly worse even if there is a high system utilization || high number of querys]. Preferably with statistics or benchmarks to prove it...
lg
Now you know what is going to happen when doing an UPDATE and as you said, you want an UPDATE, I suggest you gather the metrics based on your data and share the results here.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.