I’m curious if anyone is able to comment on the handling of nested documents at query time. Are nested documents skipped over in the advance or nextDoc step, or must they also be considered if, for example, nested docs share some field names with the parent documents?
And somewhat relatedly, how does the ratio of root:nested documents affect performance?
Nested docs should not share the same field names as a parent doc, unless you do include_in_parent and include_in_root, which we for a long time are thinking to deprecate.
But if you ended up using these parameters, and have shared fields, then if you are not using nested query, the query will be run only on top level docs as we are internally rewriting this query to add an additional filter that will skip nested docs.
Thanks for your reply and the info @mayya. In hindsight, I think my true question would have been more clear if I had omitted the part about shared fields.
To follow up: when searching an index with nested fields, can nested docs be efficiently skipped if there are no nested queries (i.e. the nested fields are not part of the query)? To what degree will performance be affected if the number of nested documents hugely outnumbers the number of parent documents?
I am not super clear what you mean by "efficiently skipped", but considering that you mentioned advance and nextDoc, I think you are talking about a postings list. If you are searching with query on a field "X", the postings list for this field will contain only documents that have this field. In your 2nd scenario this will be only be parent documents, so the number of nested documents is not relevant here, and doesn't have an impact on the speed of this query.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.