Accessing Nested Documents Performance

Hello,

It is stated clearly that:

Because nested documents are indexed as separate documents, they can only be accessed within the scope of the nested query, the nested/reverse_nested, or nested inner hits.

This feels right since we do know that we MUST wrap our queries and aggregations with appropriate NESTED clauses (nested query, nested aggregation and so forth) before we can even access the nested documents.

My question is about the performance penalty of accessing nested documents. I do know that nested documents are indexed in the same segments of their root documents for performance boost. But, let's see what happens in the following scenario:

  • Let's say we have 10,000,000 root documents in our index. Only some portion of the root documents (about 33%) include nested documents. When this happens, the root document is related to about 25 nested documents in average.

  • Let's say our query looks something like this: Query

I imagine Elasticsearch have to to first access all of the ROOT documents, and only then access the NESTED documents themselves. It cannon access the NESTED documents directly the same way it runs a query on the root documents. Should I really consider that as bad practice? (Maybe it's better to index those nested documents in other index as well, for example, as root documents for direct querying and filtering). After all, we suppose to filter the root documents based on their related nested documents - and that's not exactly what is done here.

Whether the nested aggregation needs to go over all root documents (and then all of their nested documents) depends on the query that you have specified. The query basically dictates which root document need to be accessed. In your query you have not specified a query and then ES will use match_all query, which will then emit all root document as matches then the nested aggregation needs to access all root documents. However usually a query reduces the number of documents need to be accessed by nested aggregation (actually the entire aggegation framework) significantly.

Hey @mvg, thanks for the response.

Actually, I omitted the query on purpose in order to demonstrate the problem. Let's say we have in index that contains Products - as root documents and Orders - and nested documents. Meaning that we attach orders underneath the relevant products.

Now, let's say we have much more products than orders. Meaning that the index contains 100,000,000 documents of products but only 250 documents of orders. Now, we want to query about the latest orders committed. Our query would be something like that: Query.

Please, pay attention that I have to write the range filter in the query clause - for skipping irrelevant root documents (performance boost). But, I also have to write the range filter in the aggregation clause since product might have orders from other dates as well. That's way I believe that making such queries is expensive. It's probably better to index the orders in a separate dedicated index as well, right? For direct access to the orders themselves with much lightweight index (in both size and doc count).

Thinking about it from a Relational standpoint, wouldn't you want the Order document to be your root document and the product document to be nested inside the order document. As one order can have multiple products in it....

Interesting point @Rory. If a user want to purchase 3 products, then 3 orders will be created with the same order id. Actually, it's much easier to handle the data that way in my opinion. We aren't interested in which products have been bought together with other products - but which products have been bought the most.

Anyway, even in that way - replacing the nesting order - I still have the problem when I'll want to get the latest created products (instead of latest created customers). I want the ability to get both of them.

@panda2004 in the model where order document is the root element, you can still write a query that will give you which documents are the most bought or most sought after. If I am understanding your original question correct, when you say you have 10,000,000 root documents, are you saying you have 10,000,000 distinct products?

Yes, that's correct. I have a lot of distinct products, and my orders documents are much much smaller in matter of count.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.