Suppose I have an index with a parent type and a child type and i'd wish to search parents, returning parent documents along with their children.
I wonder what is now the best way to fetch the children of the top hits when the query has no has_child, e.g. a match-all query?
wrap the top-level query inside a boolean query as a filter|must clause
and add a should-clause containing a dummy has_child query to fetch the inner hits
or perhaps:
2) execute the query normally, without any innerhits, then grab the id's of the top hits and execute a
subsequent query to fetch the children
i am worried that 1) will not perform optimally due to the dummy boolean should-clause that may retrieve all children, not just the children of the top parent hits?
approach 2) on the other hand needs an extra round-trip but should otherwise be cheap.
This approach will not retrieve all children docs. As the inner hits are only returned in the top matching parent hits being returned. The overhead here is that a has_child query is used and this perform a join. However if the other query is something like a range, match or term query that filters down the number of parent matches than the cost of the join is acceptable.
If you're okay with an extra round trip then this is a good way to avoid using a has_child query.
regarding 1):
In our case, the other query can be any query and can match (tens of) millions of parents.
but our app only needs the top 100 or so hits, including any children.
From your answer, I do not quite get whether this will perform as quickly as 2),
do I understand correctly that the innerhits implementation is capable of just fetching
the children for the 100 top hits and no other hits?
How 'about when the search request has an fromparameter, for example, requesting a slice of 100 hits,
starting at the 1000th hit. Will innerhits just fetch the children for the 100 returned parents or will it perform more work?
Only the inner hits will be fetched of parent hits that are actually being returned. So if size is 10, from is 100 and there are 1M total hits found then the inner hits will only be included for 10 hits being returned.
I think second approach will be the best approach here, since you only need the top match children for each returned parent document. The extra round trip is likely to take less time time then performing the join with the has_child query, which actually is overkill, because you don't need this join to begin with. (since you don't query or aggregate on child fields)
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.