What's nested documents layout inside the lucene?

makeyang · September 7, 2016, 6:48am

with nested objects, all entities live within the same document while
but when use the sample in reference doc below, it is 3 docs.
so my question is: what's nested documents layout inside the lucene
PUT /my_index
{
"mappings": {
"blogpost": {
"properties": {
"comments": {
"type": "nested",
"properties": {
"name": { "type": "string" },
"comment": { "type": "string" },
"age": { "type": "short" },
"stars": { "type": "short" },
"date": { "type": "date" }
}
}
}
}
}
}

PUT /my_index/blogpost/1
{
"title": "Nest eggs",
"body": "Making your money work...",
"tags": [ "cash", "shares" ],
"comments": [
{
"name": "John Smith",
"comment": "Great article",
"age": 28,
"stars": 4,
"date": "2014-09-01"
},
{
"name": "Alice White",
"comment": "More like this please",
"age": 31,
"stars": 5,
"date": "2014-10-22"
}
]
}

colings86 · September 7, 2016, 8:04am

I'd recommend reading this section of the definitive guide book: https://www.elastic.co/guide/en/elasticsearch/guide/2.x/nested-objects.html

Near the bottom it explains:

By mapping the comments field as type nested instead of type object, each nested object is indexed as a hidden separate document, ...

makeyang · September 7, 2016, 8:21am

exactly I read it and this question turns out to me that how's document layout inside the lucene.
what u shard doesn't help to answer my question.

colings86 · September 7, 2016, 8:34am

As explained on the link, the nested documents are indexed as separate documents that reside in the segment next to the parent document. This way they can be identified as nested documents of that parent so it conceptually looks like the following in the index:

# First nested object
{ 
  "comments.name":    [ john, smith ],
  "comments.comment": [ article, great ],
  "comments.age":     [ 28 ],
  "comments.stars":   [ 4 ],
  "comments.date":    [ 2014-09-01 ]
}
# Second nested object
{ 
  "comments.name":    [ alice, white ],
  "comments.comment": [ like, more, please, this ],
  "comments.age":     [ 31 ],
  "comments.stars":   [ 5 ],
  "comments.date":    [ 2014-10-22 ]
}
# The root or parent document
{ 
  "title":            [ eggs, nest ],
  "body":             [ making, money, work, your ],
  "tags":             [ cash, shares ]
}

So in the index there are in fact 3 physical documents but only 1 logical document.

If you still have questions maybe you could reform you question to be more specific about what you are wanting to understand?

makeyang · September 7, 2016, 9:13am

if I enable _source field, what is contained in root doc's _source? the whole or only root part? what is contained in nested doc's _source?
in which field contains root doc's id in nested doc?
since it is 3 docs, why not let nested doc be searched seperately?
what diff insde the lucene for parent-children and nested doc?

colings86 · September 7, 2016, 9:29am

_source of the root document contains the all of the source. The nested document does not have the _source field at all.

There is not a field that contains the doc id for the root document. The nested documents are physically located in the Lucene segment next to the root document. The locality of the nested documents alongside the root document is what makes nested operations faster than parent/child operations but also why you have to reindex the root document and all nested documents when you want to change only a single nested document.

The fact that internally we split the original document into separate Lucene documents is an implementation detail and shouldn't be exposed to the user. The nested objects are logically part of the root document and so its simpler to understand if the nested objects are only considered in the context of their root document. Also, since the _source is not stored for the nested documents you would not be able to retrieve fields if you were able to search them independently.

See Parent-Child Relationship | Elasticsearch: The Definitive Guide [2.x] | Elastic for more details but in parent-child the parent and child documents are indexed completely independently and the children contain a _parent field which holds the id of their parent document. The only locality restraint on parent-child documents is that they must reside on the same shard (whereas nested documents and their root document have to be sequentially next to each other in the segment). This means that you are able to update a child document without re-indexing the other children and the parent documents but has a cost at query time since the locality of the documents is not as tight.

makeyang · September 7, 2016, 9:59am

The nested documents are physically located in the Lucene segment next to the root document
where is the size of the sub doc stored?

colings86 · September 7, 2016, 10:39am

Internally we produce a bitset at query time of where the root documents are located. This is done at the moment by marking the nested document with a special _type value. You should not rely on this however as this is a deep internal implementation detail and not part of the user facing bit of the feature so could change at any time.

makeyang · September 8, 2016, 7:15am

so in this case, the special _type is _comments, right?

makeyang · September 8, 2016, 7:30am

And one more question: how do u handle scroll/scan request? filter out nested doc based on type?

colings86 · September 8, 2016, 8:11am

It would be __comments in this case

I don't understand the question here. Could you explain in a bit more detail what you mean?

makeyang · September 8, 2016, 8:16am

MatchAllQueryBuilder maq = QueryBuilders.matchAllQuery();
SearchResponse sResponse = client.prepareSearch("my_index")
.setSearchType(SearchType.SCAN)
.setQuery(maq)
.setScroll(new TimeValue(1))
.setSize(10)
.execute()
.actionGet();
in our case, I scroll/scan "my_index" and only root docs returned.
so my question is how dose sub docs be filtered out?

colings86 · September 8, 2016, 8:22am

Right so as I said before the fact that the nested objects are indexed as separate documents is an implementation details and is hidden from the user. Unless you use a nested query or aggregation type the nested documents will be ignored by the query.

makeyang · September 8, 2016, 8:35am

got it, thanks.
this is very helpful.

Topic		Replies	Views
Having same document id for different document types Elasticsearch	9	4621	July 6, 2017
Designing array of fields accessible by the Lucene expression language scripts Elasticsearch	1	444	May 31, 2017
Nested update Elasticsearch	3	364	July 6, 2017
How is Elasticsearch-data represented in Lucene Elasticsearch	5	1199	July 5, 2017
Nested objects queries Elasticsearch	3	516	September 21, 2019

What's nested documents layout inside the lucene?

Related topics