Tuning nested documents

Ivan · June 3, 2013, 6:27pm

I have been A/B testing two indices that are similar except for one of them
containing nested documents. The nested documents are not used in the query
and are ignored for the time-being on the client side.

The index size with the nested documents is about 20% bigger and each
parent document contains on average just over 1 nested documents, although
the number of nested documents can also be in the hundreds. All
fields(around 30) in the nested document are not indexed, so the increase
in index size should be from purely from the increased size of the source.
So far, nothing out of the ordinary.

The only non-default is setting include_in_all to false.

Searching on the index with nested documents is slower although the nested
documents are not indexed. I expected the off-heap cache to be increased
due to the larger source documents, but it is the field cache usage that
has doubled, causing GCs along the way.

My questions are why has the field cache usage increased and are nested
documents still loaded although they are not used in the Lucene cache?

Using 0.20.0.RC1 + mmapfs + mlockall

Cheers,

Ivan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · June 5, 2013, 5:19pm

The cluster has become unstable after the addition of the unused nested
documents. The field cache usage has more than doubled. Are there any
non-index settings for nested documents?

--
Ivan

On Mon, Jun 3, 2013 at 11:27 AM, Ivan Brusic ivan@brusic.com wrote:

I have been A/B testing two indices that are similar except for one of
them containing nested documents. The nested documents are not used in the
query and are ignored for the time-being on the client side.

The index size with the nested documents is about 20% bigger and each
parent document contains on average just over 1 nested documents, although
the number of nested documents can also be in the hundreds. All
fields(around 30) in the nested document are not indexed, so the increase
in index size should be from purely from the increased size of the source.
So far, nothing out of the ordinary.

The only non-default is setting include_in_all to false.

Searching on the index with nested documents is slower although the nested
documents are not indexed. I expected the off-heap cache to be increased
due to the larger source documents, but it is the field cache usage that
has doubled, causing GCs along the way.

My questions are why has the field cache usage increased and are nested
documents still loaded although they are not used in the Lucene cache?

Using 0.20.0.RC1 + mmapfs + mlockall

Cheers,

Ivan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

mvg · June 5, 2013, 6:28pm

Hi Ivan,

The field data uses an ordinal array (Lucene docId to pointer in values
array), if nested objects are enabled then one ES document spans more than
one Lucene document. These ordinals array length is based on maxDoc (Total
number of Lucene docs per segment and total documents marked as deleted per
segment). Even if all fields inside your nested object aren't indexed, the
hidden Lucene documents are added to the index and all your other field
data entries use more memory because maxDoc is higher.

Btw are you faceting or sorting on fields that have more then one token per
document? This can lead to very high memory usage in versions before 0.90.x
and with the nested documents this in turn can make it even worse.

Martijn

On 5 June 2013 19:19, Ivan Brusic ivan@brusic.com wrote:

The cluster has become unstable after the addition of the unused nested
documents. The field cache usage has more than doubled. Are there any
non-index settings for nested documents?

--
Ivan

On Mon, Jun 3, 2013 at 11:27 AM, Ivan Brusic ivan@brusic.com wrote:

I have been A/B testing two indices that are similar except for one of
them containing nested documents. The nested documents are not used in the
query and are ignored for the time-being on the client side.

The index size with the nested documents is about 20% bigger and each
parent document contains on average just over 1 nested documents, although
the number of nested documents can also be in the hundreds. All
fields(around 30) in the nested document are not indexed, so the increase
in index size should be from purely from the increased size of the source.
So far, nothing out of the ordinary.

The only non-default is setting include_in_all to false.

Searching on the index with nested documents is slower although the
nested documents are not indexed. I expected the off-heap cache to be
increased due to the larger source documents, but it is the field cache
usage that has doubled, causing GCs along the way.

My questions are why has the field cache usage increased and are nested
documents still loaded although they are not used in the Lucene cache?

Using 0.20.0.RC1 + mmapfs + mlockall

Cheers,

Ivan

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · June 5, 2013, 7:13pm

Thanks Martijn. It now somewhat makes sense that a doubling of the amount
of documents accounts for a doubling of the field cache. Even with no
indexed fields, the nested documents are still allocated a spot in the
cache if the parent is loaded. I was hoping this only occurred if a
JoinQuery was used.

The data definitely has a high cardinality on the faceted fields and I know
that 0.90+ has a better memory pattern. The application is also heavily
tied to Lucene, so an upgrade to 0.90+ requires a Lucene upgrade, which is
a long process. My goal was to wait for the 1.0 release before upgrading,
but that is another issue.

Cheers,

Ivan

On Wed, Jun 5, 2013 at 11:28 AM, Martijn v Groningen <
martijn.v.groningen@gmail.com> wrote:

Hi Ivan,

The field data uses an ordinal array (Lucene docId to pointer in values
array), if nested objects are enabled then one ES document spans more than
one Lucene document. These ordinals array length is based on maxDoc (Total
number of Lucene docs per segment and total documents marked as deleted per
segment). Even if all fields inside your nested object aren't indexed, the
hidden Lucene documents are added to the index and all your other field
data entries use more memory because maxDoc is higher.

Btw are you faceting or sorting on fields that have more then one token
per document? This can lead to very high memory usage in versions before
0.90.x and with the nested documents this in turn can make it even worse.

Martijn

On 5 June 2013 19:19, Ivan Brusic ivan@brusic.com wrote:

The cluster has become unstable after the addition of the unused nested
documents. The field cache usage has more than doubled. Are there any
non-index settings for nested documents?

--
Ivan

On Mon, Jun 3, 2013 at 11:27 AM, Ivan Brusic ivan@brusic.com wrote:

I have been A/B testing two indices that are similar except for one of
them containing nested documents. The nested documents are not used in the
query and are ignored for the time-being on the client side.

The index size with the nested documents is about 20% bigger and each
parent document contains on average just over 1 nested documents, although
the number of nested documents can also be in the hundreds. All
fields(around 30) in the nested document are not indexed, so the increase
in index size should be from purely from the increased size of the source.
So far, nothing out of the ordinary.

The only non-default is setting include_in_all to false.

Searching on the index with nested documents is slower although the
nested documents are not indexed. I expected the off-heap cache to be
increased due to the larger source documents, but it is the field cache
usage that has doubled, causing GCs along the way.

My questions are why has the field cache usage increased and are nested
documents still loaded although they are not used in the Lucene cache?

Using 0.20.0.RC1 + mmapfs + mlockall

Cheers,

Ivan

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

simonw_2 · June 6, 2013, 12:34pm

Ivan, I don't understand why upgrading lucene is such a big issue. Can you
elaborate on that a bit? The index format is compatible though, are you
using a plugin or so?

simon

On Wednesday, June 5, 2013 9:13:44 PM UTC+2, Ivan Brusic wrote:

Thanks Martijn. It now somewhat makes sense that a doubling of the amount
of documents accounts for a doubling of the field cache. Even with no
indexed fields, the nested documents are still allocated a spot in the
cache if the parent is loaded. I was hoping this only occurred if a
JoinQuery was used.

The data definitely has a high cardinality on the faceted fields and I
know that 0.90+ has a better memory pattern. The application is also
heavily tied to Lucene, so an upgrade to 0.90+ requires a Lucene upgrade,
which is a long process. My goal was to wait for the 1.0 release before
upgrading, but that is another issue.

Cheers,

Ivan

On Wed, Jun 5, 2013 at 11:28 AM, Martijn v Groningen <
martijn.v...@gmail.com <javascript:>> wrote:

Hi Ivan,

The field data uses an ordinal array (Lucene docId to pointer in values
array), if nested objects are enabled then one ES document spans more than
one Lucene document. These ordinals array length is based on maxDoc (Total
number of Lucene docs per segment and total documents marked as deleted per
segment). Even if all fields inside your nested object aren't indexed, the
hidden Lucene documents are added to the index and all your other field
data entries use more memory because maxDoc is higher.

Btw are you faceting or sorting on fields that have more then one token
per document? This can lead to very high memory usage in versions before
0.90.x and with the nested documents this in turn can make it even worse.

Martijn

On 5 June 2013 19:19, Ivan Brusic <iv...@brusic.com <javascript:>> wrote:

The cluster has become unstable after the addition of the unused nested
documents. The field cache usage has more than doubled. Are there any
non-index settings for nested documents?

--
Ivan

On Mon, Jun 3, 2013 at 11:27 AM, Ivan Brusic <iv...@brusic.com<javascript:>

wrote:

I have been A/B testing two indices that are similar except for one of
them containing nested documents. The nested documents are not used in the
query and are ignored for the time-being on the client side.

The index size with the nested documents is about 20% bigger and each
parent document contains on average just over 1 nested documents, although
the number of nested documents can also be in the hundreds. All
fields(around 30) in the nested document are not indexed, so the increase
in index size should be from purely from the increased size of the source.
So far, nothing out of the ordinary.

The only non-default is setting include_in_all to false.

Searching on the index with nested documents is slower although the
nested documents are not indexed. I expected the off-heap cache to be
increased due to the larger source documents, but it is the field cache
usage that has doubled, causing GCs along the way.

My questions are why has the field cache usage increased and are nested
documents still loaded although they are not used in the Lucene cache?

Using 0.20.0.RC1 + mmapfs + mlockall

Cheers,

Ivan

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · June 6, 2013, 2:12pm

The rest of the codebase uses non-elasticsearch-controlled Lucene indexes.
Elasticsearch only hosts the main index, which is fairly large.

Upgrading Lucene is a chore when you have numerous custom
analyzers, similarities, etc. I actually started the port yesterday after
putting it off for what seems like quite a long
time. TokenStreamComponents, FieldTypes, moved packages, ..., I wouldn't be
surprised if my application has more Lucene dependent code than
elasticsearch does!

Cheers,

Ivan

On Thu, Jun 6, 2013 at 5:34 AM, simonw simon.willnauer@elasticsearch.comwrote:

Ivan, I don't understand why upgrading lucene is such a big issue. Can you
elaborate on that a bit? The index format is compatible though, are you
using a plugin or so?

simon

On Wednesday, June 5, 2013 9:13:44 PM UTC+2, Ivan Brusic wrote:

Thanks Martijn. It now somewhat makes sense that a doubling of the
amount of documents accounts for a doubling of the field cache. Even with
no indexed fields, the nested documents are still allocated a spot in the
cache if the parent is loaded. I was hoping this only occurred if a
JoinQuery was used.

The data definitely has a high cardinality on the faceted fields and I
know that 0.90+ has a better memory pattern. The application is also
heavily tied to Lucene, so an upgrade to 0.90+ requires a Lucene upgrade,
which is a long process. My goal was to wait for the 1.0 release before
upgrading, but that is another issue.

Cheers,

Ivan

On Wed, Jun 5, 2013 at 11:28 AM, Martijn v Groningen <
martijn.v...@gmail.com**> wrote:

Hi Ivan,

The field data uses an ordinal array (Lucene docId to pointer in values
array), if nested objects are enabled then one ES document spans more than
one Lucene document. These ordinals array length is based on maxDoc (Total
number of Lucene docs per segment and total documents marked as deleted per
segment). Even if all fields inside your nested object aren't indexed, the
hidden Lucene documents are added to the index and all your other field
data entries use more memory because maxDoc is higher.

Btw are you faceting or sorting on fields that have more then one token
per document? This can lead to very high memory usage in versions before
0.90.x and with the nested documents this in turn can make it even worse.

Martijn

On 5 June 2013 19:19, Ivan Brusic iv...@brusic.com wrote:

The cluster has become unstable after the addition of the unused nested
documents. The field cache usage has more than doubled. Are there any
non-index settings for nested documents?

--
Ivan

On Mon, Jun 3, 2013 at 11:27 AM, Ivan Brusic iv...@brusic.com wrote:

I have been A/B testing two indices that are similar except for one of
them containing nested documents. The nested documents are not used in the
query and are ignored for the time-being on the client side.

The index size with the nested documents is about 20% bigger and each
parent document contains on average just over 1 nested documents, although
the number of nested documents can also be in the hundreds. All
fields(around 30) in the nested document are not indexed, so the increase
in index size should be from purely from the increased size of the source.
So far, nothing out of the ordinary.

The only non-default is setting include_in_all to false.

Searching on the index with nested documents is slower although the
nested documents are not indexed. I expected the off-heap cache to be
increased due to the larger source documents, but it is the field cache
usage that has doubled, causing GCs along the way.

My questions are why has the field cache usage increased and are
nested documents still loaded although they are not used in the Lucene
cache?

Using 0.20.0.RC1 + mmapfs + mlockall

Cheers,

Ivan

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Nested documents performance anomaly Elasticsearch	6	595	June 3, 2019
Lucene vs Elastic Search Document Count difference and its impact on term aggregation buckets Elasticsearch	10	541	August 20, 2023
Nesting high disk usage Elasticsearch	6	1903	July 5, 2017
Nested Document performance Elasticsearch	2	701	July 6, 2017
Elasticsearch Performance Issue Elasticsearch	7	559	September 4, 2020

Tuning nested documents

Related topics