Document-level metadata (noop)

Use Case - Why document level meta data would be really handy:

  • We store big document with lots of nested child documents, so updating a document is very expensive
  • When a document is "synced" from a persistent data store, it often doesn't actually change. This is a noop in elasticsearch and very cheap
  • If we store "lastSync" on the document a full reindex is always triggered and suddenly the load on our cluster increases drastically

Adding document level meta and having changing it not triggering a reindex would solve this use case.

In a previous thread discussing this here it is mentioned that there is already an open issue for tracking this requirement. However I wasn't able to find that on github. Anyone else have the link handy?

Would love to see this feature. Thank you very much for you time and consideration!

1 Like

Can you not perhaps solve this through external versioning based on the sync time stamp?

How would that work?

We want to keep track of the last sync without actually updating the document. But to set the version to the new sync timestamp we would have to update the document.

Or can you update the document version independently without triggering a resync, ie would that count as a noop?

If you keep track of the sync timestamp and send an update with this as external version of the type 'external_gte' I would expect the update to get rejected, which should basically be a noop. You would then need to handle the error and identify this as a noop. I have not tested this though.

Well, the whole point would be to update the version without updating the document and having it reindex. Having the update reject or ignored as a noop doesn't update the version as far as I understand.

Here is a test that shows noop doesn't update the version: https://github.com/loopmediagroup/es-alchemy/blob/master/test/util/rest/data/version.spec.js#L39

We need to have meta information updated and available on the document without unnecessary reindexing the document.

You seem to suggest keeping the meta information separate. That is an option. However we need to have it available with the document. So this would require us to do a secondary query against another data store. Not ideal.

Unless I'm not understanding you correctly, what you are suggesting doesn't seem to be a solution to our problem? Looking forward to your clarification

I may very well have misunderstood the problem. I assumed you had an update timestamp on the source document that would only make them index if this was newer than what existed in Elasticsearch. The sync timestamp would therefore be per document rather than for the operation so you would not need to update documents that are not otherwise updated.

Given that documents are stored in immutable segments I think it would’ve tricky to implement the concept of document metadata that has as separate lifecycle from the document it relates to.

Is there potentially a way that we can disable reindexing if only not indexed fields have updated? I.e. improve how a noop is generated?

I've created a ticket for that here:

Given that the source is stored together with the indexed data in immutable segments I do not think what you are suggesting is possible. You might however be able to use parent-child relationship here and store metadata as a child to the large not indexed parent (or vice versa).

Oh that is a very interesting idea!

So when I add a child to a parent, the parent doesn't get re-indexed? The cost of retrieving the "meta" child with a parent should be relatively low.

I'll be taking a look at this for sure!