What's nested documents layout inside the lucene?

  1. with nested objects, all entities live within the same document while
  2. but when use the sample in reference doc below, it is 3 docs.
    so my question is: what's nested documents layout inside the lucene
    PUT /my_index
    {
    "mappings": {
    "blogpost": {
    "properties": {
    "comments": {
    "type": "nested",
    "properties": {
    "name": { "type": "string" },
    "comment": { "type": "string" },
    "age": { "type": "short" },
    "stars": { "type": "short" },
    "date": { "type": "date" }
    }
    }
    }
    }
    }
    }

PUT /my_index/blogpost/1
{
"title": "Nest eggs",
"body": "Making your money work...",
"tags": [ "cash", "shares" ],
"comments": [
{
"name": "John Smith",
"comment": "Great article",
"age": 28,
"stars": 4,
"date": "2014-09-01"
},
{
"name": "Alice White",
"comment": "More like this please",
"age": 31,
"stars": 5,
"date": "2014-10-22"
}
]
}

I'd recommend reading this section of the definitive guide book: https://www.elastic.co/guide/en/elasticsearch/guide/2.x/nested-objects.html

Near the bottom it explains:

By mapping the comments field as type nested instead of type object, each nested object is indexed as a hidden separate document, ...

exactly I read it and this question turns out to me that how's document layout inside the lucene.
what u shard doesn't help to answer my question.

As explained on the link, the nested documents are indexed as separate documents that reside in the segment next to the parent document. This way they can be identified as nested documents of that parent so it conceptually looks like the following in the index:

# First nested object
{ 
  "comments.name":    [ john, smith ],
  "comments.comment": [ article, great ],
  "comments.age":     [ 28 ],
  "comments.stars":   [ 4 ],
  "comments.date":    [ 2014-09-01 ]
}
# Second nested object
{ 
  "comments.name":    [ alice, white ],
  "comments.comment": [ like, more, please, this ],
  "comments.age":     [ 31 ],
  "comments.stars":   [ 5 ],
  "comments.date":    [ 2014-10-22 ]
}
# The root or parent document
{ 
  "title":            [ eggs, nest ],
  "body":             [ making, money, work, your ],
  "tags":             [ cash, shares ]
}

So in the index there are in fact 3 physical documents but only 1 logical document.

If you still have questions maybe you could reform you question to be more specific about what you are wanting to understand?

  1. if I enable _source field, what is contained in root doc's _source? the whole or only root part? what is contained in nested doc's _source?
  2. in which field contains root doc's id in nested doc?
  3. since it is 3 docs, why not let nested doc be searched seperately?
  4. what diff insde the lucene for parent-children and nested doc?

_source of the root document contains the all of the source. The nested document does not have the _source field at all.

There is not a field that contains the doc id for the root document. The nested documents are physically located in the Lucene segment next to the root document. The locality of the nested documents alongside the root document is what makes nested operations faster than parent/child operations but also why you have to reindex the root document and all nested documents when you want to change only a single nested document.

The fact that internally we split the original document into separate Lucene documents is an implementation detail and shouldn't be exposed to the user. The nested objects are logically part of the root document and so its simpler to understand if the nested objects are only considered in the context of their root document. Also, since the _source is not stored for the nested documents you would not be able to retrieve fields if you were able to search them independently.

See https://www.elastic.co/guide/en/elasticsearch/guide/2.x/parent-child.html for more details but in parent-child the parent and child documents are indexed completely independently and the children contain a _parent field which holds the id of their parent document. The only locality restraint on parent-child documents is that they must reside on the same shard (whereas nested documents and their root document have to be sequentially next to each other in the segment). This means that you are able to update a child document without re-indexing the other children and the parent documents but has a cost at query time since the locality of the documents is not as tight.

3 Likes

The nested documents are physically located in the Lucene segment next to the root document
where is the size of the sub doc stored?

Internally we produce a bitset at query time of where the root documents are located. This is done at the moment by marking the nested document with a special _type value. You should not rely on this however as this is a deep internal implementation detail and not part of the user facing bit of the feature so could change at any time.

1 Like

so in this case, the special _type is _comments, right?

And one more question: how do u handle scroll/scan request? filter out nested doc based on type?

It would be __comments in this case

I don't understand the question here. Could you explain in a bit more detail what you mean?

1 Like

MatchAllQueryBuilder maq = QueryBuilders.matchAllQuery();
SearchResponse sResponse = client.prepareSearch("my_index")
.setSearchType(SearchType.SCAN)
.setQuery(maq)
.setScroll(new TimeValue(1))
.setSize(10)
.execute()
.actionGet();
in our case, I scroll/scan "my_index" and only root docs returned.
so my question is how dose sub docs be filtered out?

Right so as I said before the fact that the nested objects are indexed as separate documents is an implementation details and is hidden from the user. Unless you use a nested query or aggregation type the nested documents will be ignored by the query.

1 Like

got it, thanks.
this is very helpful.