Use Case - Why document level meta data would be really handy:
We store big document with lots of nested child documents, so updating a document is very expensive
When a document is "synced" from a persistent data store, it often doesn't actually change. This is a noop in elasticsearch and very cheap
If we store "lastSync" on the document a full reindex is always triggered and suddenly the load on our cluster increases drastically
Adding document level meta and having changing it not triggering a reindex would solve this use case.
In a previous thread discussing this here it is mentioned that there is already an open issue for tracking this requirement. However I wasn't able to find that on github. Anyone else have the link handy?
Would love to see this feature. Thank you very much for you time and consideration!
We want to keep track of the last sync without actually updating the document. But to set the version to the new sync timestamp we would have to update the document.
Or can you update the document version independently without triggering a resync, ie would that count as a noop?
If you keep track of the sync timestamp and send an update with this as external version of the type 'external_gte' I would expect the update to get rejected, which should basically be a noop. You would then need to handle the error and identify this as a noop. I have not tested this though.
Well, the whole point would be to update the version without updating the document and having it reindex. Having the update reject or ignored as a noop doesn't update the version as far as I understand.
We need to have meta information updated and available on the document without unnecessary reindexing the document.
You seem to suggest keeping the meta information separate. That is an option. However we need to have it available with the document. So this would require us to do a secondary query against another data store. Not ideal.
Unless I'm not understanding you correctly, what you are suggesting doesn't seem to be a solution to our problem? Looking forward to your clarification
I may very well have misunderstood the problem. I assumed you had an update timestamp on the source document that would only make them index if this was newer than what existed in Elasticsearch. The sync timestamp would therefore be per document rather than for the operation so you would not need to update documents that are not otherwise updated.
Given that documents are stored in immutable segments I think it would’ve tricky to implement the concept of document metadata that has as separate lifecycle from the document it relates to.
Given that the source is stored together with the indexed data in immutable segments I do not think what you are suggesting is possible. You might however be able to use parent-child relationship here and store metadata as a child to the large not indexed parent (or vice versa).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.