Parent ID cache unexpectedly large and growing

Howdy,

We've got an Elasticsearch 0.90.2 instance index where we're trying to use
a grandparent -> parent -> child mapping. The ratio of documents is roughly
1:10:100 and the total number is around 25,000,000. We've found that the
parent ID cache is unexpectedly large for this use case. The _id field is a
32 character string so naively we'd expect the ID cache to be about 2 GB.
Right now the statistics are showing it closer to 13 GB and growing. The
entire index itself is only 22 GB so having the ID cache be this big is
odd. Even stranger, nothing is actually doing a has_child/has_parent query
yet with the child mappings so I wouldn't expect the cache to even be
populated.

A lot of the child documents are short lived which may be having an impact
as IDs are not reused.

There's a "reuse" flag in the SimpleIdCache module source that seems like
it might help but it defaults to off. Does anyone know exactly what that
flag does?

Also the ID cache doesn't seem to be clearable. Calling
_cache/clear?id_cache=true through the REST API doesn't seem to actually
clear the cache, at least according to the statistics. Then again, I'm not
sure I trust the ID cache size stats as currently it's indicating the ID
cache is nearly the entire heap allocation of 16 GB.

Has anyone else experienced this? Is there something I'm misreading here? I
know there's a performance improvement for has_child queries in 0.90.3 but
we're mostly expecting to use has_parent.

Cheers,
Dan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

We've got an Elasticsearch 0.90.2 instance index where we're trying to use
a grandparent -> parent -> child mapping. The ratio of documents is roughly
1:10:100 and the total number is around 25,000,000. We've found that the
parent ID cache is unexpectedly large for this use case. The _id field is a
32 character string so naively we'd expect the ID cache to be about 2 GB.
Right now the statistics are showing it closer to 13 GB and growing. The
entire index itself is only 22 GB so having the ID cache be this big is
odd. Even stranger, nothing is actually doing a has_child/has_parent query
yet with the child mappings so I wouldn't expect the cache to even be
populated.

The id_cache shouldn't be populated if you're not using the parent/child
queries/filters (has_child, has_parent, top_children), so that is
unexpected. Maybe a parent/child query is used in a warmer?

There's a "reuse" flag in the SimpleIdCache module source that seems like

it might help but it defaults to off. Does anyone know exactly what that
flag does?

I tries to reuse parent ids between segments and shards on the same node.
The reason it is turned off by default is that it slows down the loading of
the id_cache, depending on how many shards and segments per shard there are
on a node. You can try to enable it, see if the loading time is acceptable
for you. I do expect it to reduce the memory usage, because all most in all
cases the same ids appear across segments and shards.

Also the ID cache doesn't seem to be clearable. Calling
_cache/clear?id_cache=true through the REST API doesn't seem to actually
clear the cache, at least according to the statistics. Then again, I'm not
sure I trust the ID cache size stats as currently it's indicating the ID
cache is nearly the entire heap allocation of 16 GB.

Odd, I expect that the id_cache size in the node stats api to be zero
when the id_cache is cleared via the clear cache api.

Has anyone else experienced this? Is there something I'm misreading here?
I know there's a performance improvement for has_child queries in 0.90.3
but we're mostly expecting to use has_parent.

The size of the id_cache depends on the amount of parent documents in your
index: https://github.com/elasticsearch/elasticsearch/issues/3516

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Thursday, August 22, 2013 9:05:04 PM UTC+10, Martijn v Groningen wrote:

The id_cache shouldn't be populated if you're not using the parent/child
queries/filters (has_child, has_parent, top_children), so that is
unexpected. Maybe a parent/child query is used in a warmer?

This is correct. It wasn't a warmer, but it turns out we did have one
system doing parent/child queries that we weren't aware of.

There's a "reuse" flag in the SimpleIdCache module source that seems like

it might help but it defaults to off. Does anyone know exactly what that
flag does?

I tries to reuse parent ids between segments and shards on the same node.
The reason it is turned off by default is that it slows down the loading of
the id_cache, depending on how many shards and segments per shard there are
on a node. You can try to enable it, see if the loading time is acceptable
for you. I do expect it to reduce the memory usage, because all most in all
cases the same ids appear across segments and shards.

Okay, we'll try this out and see if it helps.

Odd, I expect that the id_cache size in the node stats api to be zero
when the id_cache is cleared via the clear cache api.

I'll try and dig in to this a bit further, though I think I've found the
bug. It looks like the onRemoval is never called,
see https://github.com/elasticsearch/elasticsearch/pull/3561

The size of the id_cache depends on the amount of parent documents in
your index: https://github.com/elasticsearch/elasticsearch/issues/3516

Okay, so this is the bit that's confusing me. WIth the proportion of parent
to child documents that we have in this index there's about 55,000 parents
documents at the first level and then about 250,000 at the second level.
The third level has about 23,000,000 documents, but if I understand the way
the ID cache works, these shouldn't affect the size of the cache. The set
of parent documents don't change frequently so I would not expect the
apparently unbounded growth in ID cache that we're seeing. I also wouldn't
expect it to be using as much heap as the stats are claiming.

Given the numbers, does it seem reasonable that the ID cache should be as
large as we're seeing? Is the parent ID cache solely affected by the number
of parents?

Cheers,
Dan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The id cache is per segment, per segment we load all the parent values into
memory. If you see that you have many segments (via _segments api) then you
can try to run an optimize to reduce the number of segments. An optimize is
an expensive operation (just forcefully merges segments into bigger
segments) and you need to check if you can run it in your environment. An
optimize may take a long time to complete (from a few minutes to several
hours) and requires addition disk space during the optimize. Also if you
have very regular writes to your index the result of the optimize can be
easily be lost.

The child docs do have an impact on the id cache. There is a child_docid ->
parent_id lookup inside the id cache. This depends on the number of child
docs and the number of unique parent id values the child docs point to
(this is also per segment).

On 23 August 2013 01:27, Dan Everton dan@iocaine.org wrote:

On Thursday, August 22, 2013 9:05:04 PM UTC+10, Martijn v Groningen wrote:

The id_cache shouldn't be populated if you're not using the parent/child
queries/filters (has_child, has_parent, top_children), so that is
unexpected. Maybe a parent/child query is used in a warmer?

This is correct. It wasn't a warmer, but it turns out we did have one
system doing parent/child queries that we weren't aware of.

There's a "reuse" flag in the SimpleIdCache module source that seems like

it might help but it defaults to off. Does anyone know exactly what that
flag does?

I tries to reuse parent ids between segments and shards on the same node.
The reason it is turned off by default is that it slows down the loading of
the id_cache, depending on how many shards and segments per shard there are
on a node. You can try to enable it, see if the loading time is acceptable
for you. I do expect it to reduce the memory usage, because all most in all
cases the same ids appear across segments and shards.

Okay, we'll try this out and see if it helps.

Odd, I expect that the id_cache size in the node stats api to be zero
when the id_cache is cleared via the clear cache api.

I'll try and dig in to this a bit further, though I think I've found the
bug. It looks like the onRemoval is never called, see
https://github.com/elasticsearch/elasticsearch/pull/3561

The size of the id_cache depends on the amount of parent documents in
your index: https://github.com/elasticsearch/elasticsearch/
issues/3516 https://github.com/elasticsearch/elasticsearch/issues/3516

Okay, so this is the bit that's confusing me. WIth the proportion of
parent to child documents that we have in this index there's about 55,000
parents documents at the first level and then about 250,000 at the second
level. The third level has about 23,000,000 documents, but if I understand
the way the ID cache works, these shouldn't affect the size of the cache.
The set of parent documents don't change frequently so I would not expect
the apparently unbounded growth in ID cache that we're seeing. I also
wouldn't expect it to be using as much heap as the stats are claiming.

Given the numbers, does it seem reasonable that the ID cache should be as
large as we're seeing? Is the parent ID cache solely affected by the number
of parents?

Cheers,
Dan

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.