Best way to store document chunks for vector search as production standard

Hi, working on a RAG setup and trying to land on a sensible production architecture for chunk storage and retrieval. Curious what others are running at scale.

Large documents get split into chunks at ingestion, each chunk gets a vector embedding. The parent document has metadata that may change over time. The chunk text and vectors should stay the same after indexing.

We've looked at three approaches:

Flat chunks (each chunk is its own document with a parent_id field): the relationship between chunk and parent exists only on the application side, the engine has no awareness of it at all. So beyond the basic indexing, the application has to manage the full lifecycle: grouping search results by parent, picking the best scoring chunk, extracting the matched text, over-fetching to end up with enough results after deduplication, cleaning up orphan chunks on parent delete, and keeping parent metadata in sync on every chunk. On top of that, any parent field used as a search filter has to be copied onto every chunk document, so changing it means updating potentially hundreds of documents at once.

Nested (chunks as nested objects on the root document): the relationship is managed by the engine, which is the main appeal. Engine handles parent deduplication natively and returns the parent document directly from a chunk-level vector search, no grouping logic needed on our side. Parent-level filters also work without copying fields onto every chunk. What we're less sure about is production behaviour: the docs mention a performance overhead for nested queries compared to flat, and updating any field on the parent rewrites the whole block including all nested chunks. For frequent metadata updates on large documents, is this a real problem in practice or not noticeable?

Parent/Child join: we looked at this briefly and dropped it. The docs explicitly say has_child/has_parent queries add significant overhead, and there are threads here with 12+ second query times even on small datasets.

So the question is: for this kind of chunk storage setup, is nested the standard approach now? From documentations perspective all seem to push in that direction. Or is the nested query overhead actually noticeable in production and teams prefer to deal with the additional logic on the application side?

Hi @grunggy :

Nested fields is the way to go for using chunks on dense_vector fields.

Besides simplifying retrieving the parent doc, nested fields allow the usage of inner_hits to retrieve the best scoring chunks, which is something you’d likely want to do for highlighting or as part of your results. Check the docs for more details.

Hope that helps!

What is the frequency of these updates? The nested option is as Carlos pointed out likely the best option but if it results in very large documents with many nested objects updates can get quite expensive. This can definately be a very real problem in practice. It can also affect search if very large documents need to be parsed and/or returned. Elasticsearch is IMHO not optimized for working with very large documents, so it may be worthwhile determining what the maximum document size would be and run some benchmarks on this - both for querying and updating.

1 Like