Filtering Nested Child Objects

I am trying to fine tune the query on a nested collection. The nested collection should match all the criteria and the size of the collection. I created a sample data set to the mimic my use case.

Index

    {
  "mappings": {
    "properties": {
      "family": {
        "type": "nested",
        "properties": {
          "name": {
            "type": "text"
          },
          "child": {
            "type": "nested",
            "properties": {
              "name": {
                "type": "text"
              },
              "age": {
                "type": "short"
              }
            }
          }
        }
      }
    }
  }
}

Data

            "hits": [
            {
                "_index": "family",
                "_type": "_doc",
                "_id": "2",
                "_score": 1.0,
                "_source": {
                    "family": {
                        "name": "Smith",
                        "child": [
                            {
                                "name": "Alpha",
                                "age": 10
                            },
                            {
                                "name": "Beta",
                                "age": 15
                            }
                        ]
                    }
                }
            },
            {
                "_index": "family",
                "_type": "_doc",
                "_id": "3",
                "_score": 1.0,
                "_source": {
                    "family": {
                        "name": "Williamson",
                        "child": [
                            {
                                "name": "Epsilon",
                                "age": 10
                            }
                        ]
                    }
                }
            },
            {
                "_index": "family",
                "_type": "_doc",
                "_id": "1",
                "_score": 1.0,
                "_source": {
                    "family": {
                        "name": "Johnson",
                        "child": [
                            {
                                "name": "Gamma",
                                "age": 5
                            },
                            {
                                "name": "Delta",
                                "age": 10
                            },
                            {
                                "name": "Zeta",
                                "age": 15
                            }
                        ]
                    }
                }
            }
        ]

Query

    {
    "query": {
        "nested": {
            "path": "family",
            "query": {
                "nested": {
                    "path": "family.child",
                    "query": {
                        "bool": {
                            "should": [
                                {
                                    "match": { "family.child.age": 10 }
                                },
                                {
                                    "match": { "family.child.age": 15 }
                                }
                            ]
                        }
                    }
                }
            }
        }
    }
}

The intent is to fetch the families that have at least two children with ages 10 and 15. The above query returns the Williamson family which is undesirable. I don't know how to express the criteria so that is excluded. (New to ES as well). Any help would be greatly appreciated.

Welcom @mgncl!

Nested docs definitely add a layer of complexity. Each time the nested query type is used, it changes the context (the scope of documents) on which the rest of the query applies. It's almost as if nested takes you to another index to complete the rest of the query, where only fields belonging to the nested type are accessible. I have found it helpful to think of/visualize this as a fan-out pattern. I'll try to convey my thought process using your posted query as an example:

query : context is all "parent" or "top-level" docs in the index without any nested fields (I see nested fields as invisible in this context). Nested fields are inaccessible from this context. In you specific example, this means that there are no fields on which to search in this context.

query->nested->family->query: all the family docs are now the scope of the query. Fields on the top-level doc (again, there are none in your example) are out of scope and unavailable. Note that other nested fields (child) and their properties are also still out of scope.

query->nested->family->query->nested->family.child->query: the scope of the query is now all family.child documents. At this point I can concretely demonstrate the fan-out I initially mentioned. The scope of the query is now limited to these documents:

[
    {
        "name": "Alpha",
        "age": 10,
        "_id": "2"
    },
    {
        "name": "Beta",
        "age": 15,
        "_id": "2"
    },
    {
        "name": "Epsilon",
        "age": 10,
        "_id": "3"
    },
    {
        "name": "Gamma",
        "age": 5,
        "_id": "1"
    },
    {
        "name": "Delta",
        "age": 10,
        "_id": "1"
    },
    {
        "name": "Zeta",
        "age": 15,
        "_id": "1"
    }
]

To help illustrate my point, I also added the top-level document ID to which each family.child document belongs. They're strictly for illustration.

Now the query that we're going to run against these docs is (adapted from your original query to remove the family.child prefix):

{
    "query": {
        "bool": {
            "should": [
                {
                    "match": {
                        "age": 10
                    }
                },
                {
                    "match": {
                        "age": 15
                    }
                }
            ]
        }
    }
}

With it broken down like this, and if we assume that the query will return the _id (i.e. the top-level document to which the nested doc belongs), I think it becomes more clear how we get 3 hits. It's because the query is only asking questions about the family.child documents.

You stated that the question you want your query to answer is: "which families have at least two children with ages 10 and 15?" To answer that, we need the query to ask questions about the family documents (or the top-level documents, since they contain the family field). Here are 2 queries which achieve that:

Ask about family docs i.e. "considering only family, which have a child of age 10 AND (must) a child of age 15?":

{
    "query": {
        "nested": {
            "path": "family",
            "query": {
                "bool": {
                    "must": [
                        {
                            "nested": {
                                "path": "family.child",
                                "query": {
                                    "term": {
                                        "family.child.age": 10
                                    }
                                }
                            }
                        },
                        {
                            "nested": {
                                "path": "family.child",
                                "query": {
                                    "term": {
                                        "family.child.age": 15
                                    }
                                }
                            }
                        }
                    ]
                }
            }
        }
    }
}

Ask about top-level docs i.e. "considering top-level documents, which have a family with a child of age 10 AND (must) a family with a child of age 15?" Note that in this case, the children could be from any family on the top-level doc and do not need to belong to the same family. Depending on whether or not you expect family to have multiple docs, you would need to choose the query which best fits the question you're asking about your data:

{
    "query": {
        "bool": {
            "must": [
                {
                    "nested": {
                        "path": "family",
                        "query": {
                            "nested": {
                                "path": "family.child",
                                "query": {
                                    "term": {
                                        "family.child.age": 10
                                    }
                                }
                            }
                        }
                    }
                },
                {
                    "nested": {
                        "path": "family",
                        "query": {
                            "nested": {
                                "path": "family.child",
                                "query": {
                                    "term": {
                                        "family.child.age": 15
                                    }
                                }
                            }
                        }
                    }
                }
            ]
        }
    }
}

That was very long-winded, my apologies! I hope it's more helpful than it is confusing. Please ask more follow-ups if I can help clarify anything :blush:

1 Like

I should also note that if you don't care about scoring, you could use filter instead of must in the query examples I provided.

Thank You @egalpin for the detailed explanation and response. This is really close to what I am trying to accomplish. There is an edge condition that I need help with. If the age criteria is the same (a hypothetical twin scenario) I am not getting the expected results. So for the same data set above if I were to query as below with the same age for both the children:

{
    "query": {
        "nested": {
            "path": "family",
            "query": {
                "bool": {
                    "must": [
                        {
                            "nested": {
                                "path": "family.child",
                                "query": {
                                    "term": {
                                        "family.child.age": 10
                                    }
                                }
                            }
                        },
                        {
                            "nested": {
                                "path": "family.child",
                                "query": {
                                    "term": {
                                        "family.child.age": 10
                                    }
                                }
                            }
                        }
                    ]
                }
            }
        }
    }
}

I am still getting the Williamson who has only one child of age 10. Is there a way to exclude this with a count check. I would rather have this expressed in the query as opposed to client side logic.

I don’t think that’s possible using a query on the current data structure. It might be possible copying values to the root in the index and using a form of scripted query but that’s pretty complex.

The one thing that’s worth pointing out with the existing design is the top level “family” need not be a nested structure. It does not contain an array of objects whose properties can otherwise be muddled - it only has a name property (but clearly the array of children benefits from the nested declaration).
This old slide deck from when nested docs were first introduced has some images that might help visualise why nested is needed sometimes and how it works.

I agree that, if at all possible, it might be worth revisiting the data modeling phase. I think it should also be mentioned that in the hypothetical "twin" example, the query is a tautology; the 2nd clause in the must field does nothing to alter the query. It's akin to if A && A.

I have a less than ideal approach using aggregations that doesn't feel flexible enough to be considered a solution. I would expect it to perform poorly if the data set is large, and one would need to be careful of memory consumption. But here is a query which produces the result you are after (note that include and min_doc_count must be changed to reflect the rest of the query). I don't feel that this is a good approach, but the result is as specified (Johnson family changed to have twins age 10):

{
    "query": {
        "nested": {
            "path": "family",
            "query": {
                "nested": {
                    "path": "family.child",
                    "query": {
                        "term": {
                            "family.child.age": 10
                        }
                    }
                }
            }
        }
    },
    "aggs": {
        "family": {
            "nested": {
                "path": "family"
            },
            "aggs": {
                "children": {
                    "nested": {
                        "path": "family.child"
                    },
                    "aggs": {
                        "age": {
                            "terms": {
                                "field": "family.child.age",
                                "include": [
                                    10
                                ]
                            },
                            "aggs": {
                                "parent_doc_ids": {
                                    "terms": {
                                        "field": "_id",
                                        "min_doc_count": 2,
                                        "size": 5
                                    },
                                    "aggs": {
                                        "parent_doc_hits": {
                                            "reverse_nested": {},
                                            "aggs": {
                                                "parent_doc": {
                                                    "top_hits": {}
                                                }
                                            }
                                        }
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    },
    "size": 0
}

result:

{
    "took": 7,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 3,
            "relation": "eq"
        },
        "max_score": null,
        "hits": []
    },
    "aggregations": {
        "family": {
            "doc_count": 3,
            "children": {
                "doc_count": 6,
                "age": {
                    "doc_count_error_upper_bound": 0,
                    "sum_other_doc_count": 0,
                    "buckets": [
                        {
                            "key": 10,
                            "doc_count": 4,
                            "parent_doc_ids": {
                                "doc_count_error_upper_bound": 0,
                                "sum_other_doc_count": 0,
                                "buckets": [
                                    {
                                        "key": "3",
                                        "doc_count": 2,
                                        "parent_doc_hits": {
                                            "doc_count": 1,
                                            "parent_doc": {
                                                "hits": {
                                                    "total": {
                                                        "value": 1,
                                                        "relation": "eq"
                                                    },
                                                    "max_score": 1.0,
                                                    "hits": [
                                                        {
                                                            "_index": "foobar",
                                                            "_type": "_doc",
                                                            "_id": "3",
                                                            "_score": 1.0,
                                                            "_source": {
                                                                "family": {
                                                                    "name": "Johnson",
                                                                    "child": [
                                                                        {
                                                                            "name": "Gamma",
                                                                            "age": 5
                                                                        },
                                                                        {
                                                                            "name": "Delta",
                                                                            "age": 10
                                                                        },
                                                                        {
                                                                            "name": "Zeta",
                                                                            "age": 10
                                                                        }
                                                                    ]
                                                                }
                                                            }
                                                        }
                                                    ]
                                                }
                                            }
                                        }
                                    }
                                ]
                            }
                        }
                    ]
                }
            }
        }
    }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.