We store Marvel-style timeseries data in Elasticsearch and make very heavy
use of aggregations (all queries are effectively aggregations).
We've been playing around with the shard query cache and have a question.
Is there a reason the shard query cache is set to such a low level of JVM
heap by default? 1% seems awfully low unless ES assumes most people aren't
making heavy use of aggregations? Any harm in us significantly boosting
this from 1% to say 15% of heap? Most of our machines have 30GB of RAM and
heap at 50% of that (15GB) so the query cache is 150MB by default. Think
we'd like to experiment growing that to at least 10% of heap to have 1GB in
use for this cache.
It depends how likely it is for you to run the same aggregation again. Note
that this cache is fully invalidated at every refresh (meaning either every
second by default, or every time that you update/add/remove documents if
you perform less than 1 operation per second). So this cache will only be
used if you are likely to run the exact same request twice or more in a
short period of time.
We assumed that this situation is not common, so a small cache would be
enough for the maybe 4 or 5 requests that would be run again and again. You
can increase it if you think it will be helpful in your case although I
would advise to be careful, maybe memory would be better spent on eg. the
filesystem cache.
We store Marvel-style timeseries data in Elasticsearch and make very heavy
use of aggregations (all queries are effectively aggregations).
We've been playing around with the shard query cache and have a question.
Is there a reason the shard query cache is set to such a low level of JVM
heap by default? 1% seems awfully low unless ES assumes most people aren't
making heavy use of aggregations? Any harm in us significantly boosting
this from 1% to say 15% of heap? Most of our machines have 30GB of RAM and
heap at 50% of that (15GB) so the query cache is 150MB by default. Think
we'd like to experiment growing that to at least 10% of heap to have 1GB in
use for this cache.
Hi, I am a little confused by your response. Are you saying that
query/filter caches are invalidated across all data in a shard every time
the refresh interval ticks over?
I was under the impression that all field data and caching related
operations were performed on a Lucene index segment level and that the
caches would only be invalidated for a given segment if that segment had
changed since the last refresh. Since most data is stored in large segments
that don't take fresh writes and seldom merge this would mean that most
caches are good for long periods of time; even if the shard is under
constant indexing load. Am I mistaken?
Thanks,
James
On Thu, May 21, 2015 at 9:28 AM, Adrien Grand adrien@elastic.co wrote:
It depends how likely it is for you to run the same aggregation again.
Note that this cache is fully invalidated at every refresh (meaning either
every second by default, or every time that you update/add/remove documents
if you perform less than 1 operation per second). So this cache will only
be used if you are likely to run the exact same request twice or more in a
short period of time.
We assumed that this situation is not common, so a small cache would be
enough for the maybe 4 or 5 requests that would be run again and again. You
can increase it if you think it will be helpful in your case although I
would advise to be careful, maybe memory would be better spent on eg. the
filesystem cache.
We store Marvel-style timeseries data in Elasticsearch and make very
heavy use of aggregations (all queries are effectively aggregations).
We've been playing around with the shard query cache and have a question.
Is there a reason the shard query cache is set to such a low level of JVM
heap by default? 1% seems awfully low unless ES assumes most people aren't
making heavy use of aggregations? Any harm in us significantly boosting
this from 1% to say 15% of heap? Most of our machines have 30GB of RAM and
heap at 50% of that (15GB) so the query cache is 150MB by default. Think
we'd like to experiment growing that to at least 10% of heap to have 1GB in
use for this cache.
Hi, I am a little confused by your response. Are you saying that
query/filter caches are invalidated across all data in a shard every time
the refresh interval ticks over?
Sorry for the confusion:
the query cache caches entire requests per index, and is competely
invalidated across all data every time the refresh interval ticks over AND
there have been changes since the last refresh
the filter cache caches matching documents per segment, it is
invalidated per segment only when a segment goes away (typically because
it's been merged to a larger segment), which is unfrequent for large
segments
the fielddata cache caches the document->value mapping per segment and
has the same invalidation rules as the filter cache
I was under the impression that all field data and caching related
operations were performed on a Lucene index segment level and that the
caches would only be invalidated for a given segment if that segment had
changed since the last refresh. Since most data is stored in large segments
that don't take fresh writes and seldom merge this would mean that most
caches are good for long periods of time; even if the shard is under
constant indexing load. Am I mistaken?
This is right for the fielddata and filter caches, but not for the query
cache.
Thanks for the responses guys. We have an ES setup so that we have: hot,
warm and cold ES nodes. Hot nodes are the only ones receiving realtime
updates and have refresh intervals fairly low for indices there thus making
a query cache pretty useless for data there.
Indices on warm nodes on the other hand are only updated every night and
indices cold nodes are similar. Assuming we do have repetitive aggregation
queries, sounds like bumping up query cache on warm/cold tier could have
some significant speed ups for our more expensive aggregations.
On Thursday, 21 May 2015 19:12:44 UTC-4, Adrien Grand wrote:
On Thu, May 21, 2015 at 11:49 PM, James Macdonald < james.m...@geofeedia.com <javascript:>> wrote:
Hi, I am a little confused by your response. Are you saying that
query/filter caches are invalidated across all data in a shard every time
the refresh interval ticks over?
Sorry for the confusion:
the query cache caches entire requests per index, and is competely
invalidated across all data every time the refresh interval ticks over AND
there have been changes since the last refresh
the filter cache caches matching documents per segment, it is
invalidated per segment only when a segment goes away (typically because
it's been merged to a larger segment), which is unfrequent for large
segments
the fielddata cache caches the document->value mapping per segment and
has the same invalidation rules as the filter cache
I was under the impression that all field data and caching related
operations were performed on a Lucene index segment level and that the
caches would only be invalidated for a given segment if that segment had
changed since the last refresh. Since most data is stored in large segments
that don't take fresh writes and seldom merge this would mean that most
caches are good for long periods of time; even if the shard is under
constant indexing load. Am I mistaken?
This is right for the fielddata and filter caches, but not for the query
cache.
Hi Adrien, thanks very much for this clarification. I am always trying to
learn more about how Elasticsearch works, and that clarification was very
helpful.
James
On Thu, May 21, 2015 at 6:12 PM, Adrien Grand adrien@elastic.co wrote:
Hi, I am a little confused by your response. Are you saying that
query/filter caches are invalidated across all data in a shard every time
the refresh interval ticks over?
Sorry for the confusion:
the query cache caches entire requests per index, and is competely
invalidated across all data every time the refresh interval ticks over AND
there have been changes since the last refresh
the filter cache caches matching documents per segment, it is
invalidated per segment only when a segment goes away (typically because
it's been merged to a larger segment), which is unfrequent for large
segments
the fielddata cache caches the document->value mapping per segment and
has the same invalidation rules as the filter cache
I was under the impression that all field data and caching related
operations were performed on a Lucene index segment level and that the
caches would only be invalidated for a given segment if that segment had
changed since the last refresh. Since most data is stored in large segments
that don't take fresh writes and seldom merge this would mean that most
caches are good for long periods of time; even if the shard is under
constant indexing load. Am I mistaken?
This is right for the fielddata and filter caches, but not for the query
cache.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.