Optimizing filter bitsets

We're storing Kibana-style time series documents across three indexes on a
10 node cluster (i2.xlarges). These indexes have between 20M-500M docs at
peak and we use bool filters extensively while querying. Query volumes are
pretty low (maybe around 100 searches/sec at peak) versus index ops
(4K/sec).

Recently, I've been noticing a lot of churn in our filter cache and I'm
wondering if our bitsets are optimized or maybe if we're just hitting
memory limits because of too many documents.

I understand that the result of the bool is the bitset that's cached as
opposed to the individual term filters themselves. This had me concerned
that for certain complex bool filters (where we have >10 or so term filters
inside a "must" clause), were creating bitsets that have far too narrow an
application (basically the one query they were used for).

If we have certain terms (say customer ID, ) which update fairly
infrequently (only with new docs) and others that update fairly frequently
(say time-based fields), is there a way to optimize our bool queries to
create reusable bitsets for the infrequent term filters while also having
the benefit of caching the result of the entire bool filter?

Is it as simple as adding _cache: true to the terms filters that are fairly
static?

Anything else we can look at to help understand how to optimize our filter
cache?

Mike

--
Mike Sukmanowsky
Aspiring Digital Carpenter

e: mike.sukmanowsky@gmail.com

facebook http://facebook.com/mike.sukmanowsky | twitter
http://twitter.com/msukmanowsky | LinkedIn
http://www.linkedin.com/profile/view?id=10897143 | github
https://github.com/msukmanowsky

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAOH6cu5Xz8i9iV80onEN2R2yXA%3Dddk7uXqWCBYTo7X1dfOCvYw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

On Mon, Jan 26, 2015 at 11:05 PM, Mike Sukmanowsky <
mike.sukmanowsky@gmail.com> wrote:

I understand that the result of the bool is the bitset that's cached as
opposed to the individual term filters themselves. This had me concerned
that for certain complex bool filters (where we have >10 or so term filters
inside a "must" clause), were creating bitsets that have far too narrow an
application (basically the one query they were used for).

Actually with today'd defaults, you would create and cache one bit set for
each clause of the bool filter, and then the bool filter would just merge
bit sets. The resulting bit set from a bool filter is not cached by
default. FYI we have plans to change this in 2.0
Filter cache: add a `_cache: auto` option and make it the default. by jpountz · Pull Request #8573 · elastic/elasticsearch · GitHub by keeping
statistics about filter usage and only caching those that are both costly
and reused. So even a compound bool filter could be cached if it keeps on
being reused with the same clauses.

If we have certain terms (say customer ID, ) which update fairly
infrequently (only with new docs) and others that update fairly frequently
(say time-based fields), is there a way to optimize our bool queries to
create reusable bitsets for the infrequent term filters while also having
the benefit of caching the result of the entire bool filter?

Since the filter cache works per segment, making different choices based on
how-frequently some fields are being updated would not help.

Is it as simple as adding _cache: true to the terms filters that are
fairly static?

This might be a good idea. And since caching filters has a cost, it might
be a good idea to set _cache:false on filters that you know are unlikely to
be reused. When filters are cached, elasticsearch unfortunately needs to
evaluate all docs from the index against this filter, which can be slow. So
not caching filters which are not reused can make things faster.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j79MZsC%3D%3DDtMAMKHgbroSdpykvnw7L85oi%3D8dNpaJw5Hg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.