Best Practice: Update metadata on larger documents

I have a question about best practice in the following scenario:
I have documents with some meta fields among others with a full text field which can be up to 10 MB in size. We currently do not use parent/child relationships.

"_source" : {
  "title" : {
    "de" : "Lorem ipsum dolor sit amet"
  },
  "subtitle" : {
    "de" : "Lorem ipsum dolor sit amet"
  },
  "groups" : [
    "g090",
    "g328",
    "g007"
    ...
  ],
  "abstract" : {
    "de" : "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam"
  },
  "fulltext" : "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren,"
  ...
}

The documents themselves can be assigned to different groups in our backend. The assignment can affect several 1000 documents at the same time, but is not done all the time. This mapping should be updated in Elasticsearch as soon as possible. Currently we use the _update API for this. Since the documents - as far as I understand - are deleted, recreated and indexed internally by the update process, this process is very time and memory intensive.
One consideration was that we separate the metadata from the full text (parent/child). However, this would go on the search speed. Whereby it is not clear to me in which frame this restriction is.

Is there a general approach (mapping, update, query) how to deal with larger documents with metadata, which sometimes need to be updated in larger quantities in a timely manner?

Unfortunately, on the internet I can only find scenarios where small documents need to be updated extremely frequently.

Thanks and best regards

There is no magic solution and you have outlined the two main options and their respective drawbacks. There are no in-place updates available in Elasticsearch as all segments are immutable, so if you update a large document the whole thing need to be reindxed, which can be costy. I have seen metadata, e.g. access lists, be broken out using partent-child, but this does as you correctly stated affect query complexity, resource usge and latency.

Another question in this context. We are currently still using version 7.17.10. Are there any known performance advantages in the 8 version? Or will there be new solutions in Elasticsearch in the future? I find these update restrictions very unsatisfactory.

This is a very fundamental way of how Lucene works so I would not expect any major changes in this area. Am not aware of any improvements in this area in recent versions. I will however leave it for someone from Elastic to comment as they will have a better idea.

Have you tried using parent-child and evaluate what the latency difference is?

No, not yet. So far, the problem has also been limited, but it is becoming more significant. It's also somewhat difficult to test because our searches can be complex. To be able to make a real statement then, I would have to convert this to parent-child....
That's why I asked here first, if it makes sense to go this way at all. If there would be a rough statement for example a query will be at least twice as slow, then I would save this and continue to use the _update-API - maybe in a queue, which I can influence myself, so not update-by-query.

You have unusually large documents, so it is hard to say what impact parent-child would have without benchmarking.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.