Accessing Nested Documents Performance

panda2004 · June 21, 2017, 9:47pm

Hello,

It is stated clearly that:

Because nested documents are indexed as separate documents, they can only be accessed within the scope of the nested query, the nested/reverse_nested, or nested inner hits.

This feels right since we do know that we MUST wrap our queries and aggregations with appropriate NESTED clauses (nested query, nested aggregation and so forth) before we can even access the nested documents.

My question is about the performance penalty of accessing nested documents. I do know that nested documents are indexed in the same segments of their root documents for performance boost. But, let's see what happens in the following scenario:

Let's say we have 10,000,000 root documents in our index. Only some portion of the root documents (about 33%) include nested documents. When this happens, the root document is related to about 25 nested documents in average.
Let's say our query looks something like this: Query

I imagine Elasticsearch have to to first access all of the ROOT documents, and only then access the NESTED documents themselves. It cannon access the NESTED documents directly the same way it runs a query on the root documents. Should I really consider that as bad practice? (Maybe it's better to index those nested documents in other index as well, for example, as root documents for direct querying and filtering). After all, we suppose to filter the root documents based on their related nested documents - and that's not exactly what is done here.

mvg · June 22, 2017, 7:49am

Whether the nested aggregation needs to go over all root documents (and then all of their nested documents) depends on the query that you have specified. The query basically dictates which root document need to be accessed. In your query you have not specified a query and then ES will use match_all query, which will then emit all root document as matches then the nested aggregation needs to access all root documents. However usually a query reduces the number of documents need to be accessed by nested aggregation (actually the entire aggegation framework) significantly.

panda2004 · June 23, 2017, 3:02pm

Hey @mvg, thanks for the response.

Actually, I omitted the query on purpose in order to demonstrate the problem. Let's say we have in index that contains Products - as root documents and Orders - and nested documents. Meaning that we attach orders underneath the relevant products.

Now, let's say we have much more products than orders. Meaning that the index contains 100,000,000 documents of products but only 250 documents of orders. Now, we want to query about the latest orders committed. Our query would be something like that: Query.

Please, pay attention that I have to write the range filter in the query clause - for skipping irrelevant root documents (performance boost). But, I also have to write the range filter in the aggregation clause since product might have orders from other dates as well. That's way I believe that making such queries is expensive. It's probably better to index the orders in a separate dedicated index as well, right? For direct access to the orders themselves with much lightweight index (in both size and doc count).

Rory · June 23, 2017, 3:10pm

Thinking about it from a Relational standpoint, wouldn't you want the Order document to be your root document and the product document to be nested inside the order document. As one order can have multiple products in it....

panda2004 · June 23, 2017, 3:16pm

Interesting point @Rory. If a user want to purchase 3 products, then 3 orders will be created with the same order id. Actually, it's much easier to handle the data that way in my opinion. We aren't interested in which products have been bought together with other products - but which products have been bought the most.

Anyway, even in that way - replacing the nesting order - I still have the problem when I'll want to get the latest created products (instead of latest created customers). I want the ability to get both of them.

Rory · June 23, 2017, 3:22pm

@panda2004 in the model where order document is the root element, you can still write a query that will give you which documents are the most bought or most sought after. If I am understanding your original question correct, when you say you have 10,000,000 root documents, are you saying you have 10,000,000 distinct products?

panda2004 · June 23, 2017, 3:28pm

Yes, that's correct. I have a lot of distinct products, and my orders documents are much much smaller in matter of count.

system · July 21, 2017, 3:28pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Performance comparisons of aggregations of fields in nested documents vs at root level? Elasticsearch	1	440	February 8, 2018
Are nested docs efficiently skipped? Elasticsearch	5	464	June 3, 2021
Performance Issue with Sub-Aggregations and Nested Document Structure? Elasticsearch	3	2328	March 27, 2017
How does the search performance compare between standard and nested document structure Elasticsearch	2	168	February 14, 2024
Nested Document performance Elasticsearch	2	705	July 6, 2017

Accessing Nested Documents Performance

Related topics