As explained on the link, the nested documents are indexed as separate documents that reside in the segment next to the parent document. This way they can be identified as nested documents of that parent so it conceptually looks like the following in the index:
# First nested object
{
"comments.name": [ john, smith ],
"comments.comment": [ article, great ],
"comments.age": [ 28 ],
"comments.stars": [ 4 ],
"comments.date": [ 2014-09-01 ]
}
# Second nested object
{
"comments.name": [ alice, white ],
"comments.comment": [ like, more, please, this ],
"comments.age": [ 31 ],
"comments.stars": [ 5 ],
"comments.date": [ 2014-10-22 ]
}
# The root or parent document
{
"title": [ eggs, nest ],
"body": [ making, money, work, your ],
"tags": [ cash, shares ]
}
So in the index there are in fact 3 physical documents but only 1 logical document.
If you still have questions maybe you could reform you question to be more specific about what you are wanting to understand?
_source of the root document contains the all of the source. The nested document does not have the _source field at all.
There is not a field that contains the doc id for the root document. The nested documents are physically located in the Lucene segment next to the root document. The locality of the nested documents alongside the root document is what makes nested operations faster than parent/child operations but also why you have to reindex the root document and all nested documents when you want to change only a single nested document.
The fact that internally we split the original document into separate Lucene documents is an implementation detail and shouldn't be exposed to the user. The nested objects are logically part of the root document and so its simpler to understand if the nested objects are only considered in the context of their root document. Also, since the _source is not stored for the nested documents you would not be able to retrieve fields if you were able to search them independently.
See Parent-Child Relationship | Elasticsearch: The Definitive Guide [2.x] | Elastic for more details but in parent-child the parent and child documents are indexed completely independently and the children contain a _parent field which holds the id of their parent document. The only locality restraint on parent-child documents is that they must reside on the same shard (whereas nested documents and their root document have to be sequentially next to each other in the segment). This means that you are able to update a child document without re-indexing the other children and the parent documents but has a cost at query time since the locality of the documents is not as tight.
Internally we produce a bitset at query time of where the root documents are located. This is done at the moment by marking the nested document with a special _type value. You should not rely on this however as this is a deep internal implementation detail and not part of the user facing bit of the feature so could change at any time.
MatchAllQueryBuilder maq = QueryBuilders.matchAllQuery();
SearchResponse sResponse = client.prepareSearch("my_index")
.setSearchType(SearchType.SCAN)
.setQuery(maq)
.setScroll(new TimeValue(1))
.setSize(10)
.execute()
.actionGet();
in our case, I scroll/scan "my_index" and only root docs returned.
so my question is how dose sub docs be filtered out?
Right so as I said before the fact that the nested objects are indexed as separate documents is an implementation details and is hidden from the user. Unless you use a nested query or aggregation type the nested documents will be ignored by the query.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.