Filter cache - based on full set or result of previous filters?

Hi,

I have a search request that uses a couple of filters. I'm using bool+must,
and I'm trying to optimize the request as much as possible.

  • Some filters are used by all users of my platform, but aren't very
    selective.
  • Some filters are very specific to individual users, and are highly
    selective.

I've read that I should use the most selective filters first, to ease the
work performed by the subsequent filters.

However one thing that's not 100% clear is how the filter cache bitmaps
works. Do they store the result of a filter if performed across the entire
dataset, or does it store the filtered result of the previous filter's
output?

Example. Querying the paid invoices of an account:

{ "query":
{ "filtered":
{ "filter":
{ "bool":
{" must": [
{ "term": { "status": "paid" } }, (all users use this, but
it's not very selective)
{ "term": { "account": "123456" } }
]}
}
}
}
}

Following the advice of using the most highly selective filter first, I
should place the "account" filter first. On the other hand I want to be
sure that all users will re-use the cached output of the "status" filter.

Question: will the "status" filter cache contain all paid invoices of all
accounts, no matter in which order I use the filters?

The above code is just an example - I'm trying to optimize the code for a
dataset for 1B+ documents, so please take this into consideration.

Thanks,
Lasse

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7ea47711-38c1-4bc7-bc7c-41d85fb5cf81%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

The status filter cache will indeed contain all entries. And technically,
the cache is per segment, and not across all documents, but this should be
transparent.

Caching is enabled by default for the term filters, but disabled for the
bool filter. You can enable it if you think users will be reusing the
filter.

--
Ivan

On Tue, Nov 11, 2014 at 3:23 AM, Lasse Schou lasseschou@gmail.com wrote:

Hi,

I have a search request that uses a couple of filters. I'm using
bool+must, and I'm trying to optimize the request as much as possible.

  • Some filters are used by all users of my platform, but aren't very
    selective.
  • Some filters are very specific to individual users, and are highly
    selective.

I've read that I should use the most selective filters first, to ease the
work performed by the subsequent filters.

However one thing that's not 100% clear is how the filter cache bitmaps
works. Do they store the result of a filter if performed across the entire
dataset, or does it store the filtered result of the previous filter's
output?

Example. Querying the paid invoices of an account:

{ "query":
{ "filtered":
{ "filter":
{ "bool":
{" must": [
{ "term": { "status": "paid" } }, (all users use this, but
it's not very selective)
{ "term": { "account": "123456" } }
]}
}
}
}
}

Following the advice of using the most highly selective filter first, I
should place the "account" filter first. On the other hand I want to be
sure that all users will re-use the cached output of the "status" filter.

Question: will the "status" filter cache contain all paid invoices of
all accounts, no matter in which order I use the filters?

The above code is just an example - I'm trying to optimize the code for a
dataset for 1B+ documents, so please take this into consideration.

Thanks,
Lasse

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7ea47711-38c1-4bc7-bc7c-41d85fb5cf81%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/7ea47711-38c1-4bc7-bc7c-41d85fb5cf81%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQBXWb82GwrBgAyHKbGXbwtRJ8JaVZhEYB72EnTm%2Brp1qw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks for the explanation.

A follow-up question. If caching the filter for a specific value, say "{
"term": { "status": "paid" } }", will this somehow magically speed up the
query if searching for "status": "unpaid"? I'm not talking about a "not"
operation, but simply replacing the value with something else (like when
creating an index in a RDBMS).

2014-11-11 21:35 GMT+01:00 Ivan Brusic ivan@brusic.com:

The status filter cache will indeed contain all entries. And technically,
the cache is per segment, and not across all documents, but this should be
transparent.

Caching is enabled by default for the term filters, but disabled for the
bool filter. You can enable it if you think users will be reusing the
filter.

--
Ivan

On Tue, Nov 11, 2014 at 3:23 AM, Lasse Schou lasseschou@gmail.com wrote:

Hi,

I have a search request that uses a couple of filters. I'm using
bool+must, and I'm trying to optimize the request as much as possible.

  • Some filters are used by all users of my platform, but aren't very
    selective.
  • Some filters are very specific to individual users, and are highly
    selective.

I've read that I should use the most selective filters first, to ease the
work performed by the subsequent filters.

However one thing that's not 100% clear is how the filter cache bitmaps
works. Do they store the result of a filter if performed across the entire
dataset, or does it store the filtered result of the previous filter's
output?

Example. Querying the paid invoices of an account:

{ "query":
{ "filtered":
{ "filter":
{ "bool":
{" must": [
{ "term": { "status": "paid" } }, (all users use this, but
it's not very selective)
{ "term": { "account": "123456" } }
]}
}
}
}
}

Following the advice of using the most highly selective filter first, I
should place the "account" filter first. On the other hand I want to be
sure that all users will re-use the cached output of the "status" filter.

Question: will the "status" filter cache contain all paid invoices of
all accounts, no matter in which order I use the filters?

The above code is just an example - I'm trying to optimize the code for a
dataset for 1B+ documents, so please take this into consideration.

Thanks,
Lasse

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7ea47711-38c1-4bc7-bc7c-41d85fb5cf81%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/7ea47711-38c1-4bc7-bc7c-41d85fb5cf81%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/W5p-eeoUnr0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQBXWb82GwrBgAyHKbGXbwtRJ8JaVZhEYB72EnTm%2Brp1qw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQBXWb82GwrBgAyHKbGXbwtRJ8JaVZhEYB72EnTm%2Brp1qw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CADERWXpL6%3DEFF68jKaZkADAQLmLRNW_F%2BVDU%2ByN8Z_PbaQ29Ew%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Term filters already use lucene's term dictionary as an index. Almost
everything Elasticsearch does uses it. In fact term queries are so fast
that Elasticsearch switched them from being cached by default to uncached
by default (don't have version number handy). For the most part I wouldn't
worry about them. When I have a request that is slow I tend to remove parts
of it until I find the slow bit. That works well if the speed issue is CPU
related which most stuff seems to be.

Nik
On Nov 11, 2014 4:47 PM, "Lasse Schou" lasseschou@gmail.com wrote:

Thanks for the explanation.

A follow-up question. If caching the filter for a specific value, say "{
"term": { "status": "paid" } }", will this somehow magically speed up the
query if searching for "status": "unpaid"? I'm not talking about a "not"
operation, but simply replacing the value with something else (like when
creating an index in a RDBMS).

2014-11-11 21:35 GMT+01:00 Ivan Brusic ivan@brusic.com:

The status filter cache will indeed contain all entries. And technically,
the cache is per segment, and not across all documents, but this should be
transparent.

Caching is enabled by default for the term filters, but disabled for the
bool filter. You can enable it if you think users will be reusing the
filter.

--
Ivan

On Tue, Nov 11, 2014 at 3:23 AM, Lasse Schou lasseschou@gmail.com
wrote:

Hi,

I have a search request that uses a couple of filters. I'm using
bool+must, and I'm trying to optimize the request as much as possible.

  • Some filters are used by all users of my platform, but aren't very
    selective.
  • Some filters are very specific to individual users, and are highly
    selective.

I've read that I should use the most selective filters first, to ease
the work performed by the subsequent filters.

However one thing that's not 100% clear is how the filter cache bitmaps
works. Do they store the result of a filter if performed across the entire
dataset, or does it store the filtered result of the previous filter's
output?

Example. Querying the paid invoices of an account:

{ "query":
{ "filtered":
{ "filter":
{ "bool":
{" must": [
{ "term": { "status": "paid" } }, (all users use this, but
it's not very selective)
{ "term": { "account": "123456" } }
]}
}
}
}
}

Following the advice of using the most highly selective filter first, I
should place the "account" filter first. On the other hand I want to be
sure that all users will re-use the cached output of the "status" filter.

Question: will the "status" filter cache contain all paid invoices of
all accounts, no matter in which order I use the filters?

The above code is just an example - I'm trying to optimize the code for
a dataset for 1B+ documents, so please take this into consideration.

Thanks,
Lasse

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7ea47711-38c1-4bc7-bc7c-41d85fb5cf81%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/7ea47711-38c1-4bc7-bc7c-41d85fb5cf81%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/W5p-eeoUnr0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQBXWb82GwrBgAyHKbGXbwtRJ8JaVZhEYB72EnTm%2Brp1qw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQBXWb82GwrBgAyHKbGXbwtRJ8JaVZhEYB72EnTm%2Brp1qw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CADERWXpL6%3DEFF68jKaZkADAQLmLRNW_F%2BVDU%2ByN8Z_PbaQ29Ew%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CADERWXpL6%3DEFF68jKaZkADAQLmLRNW_F%2BVDU%2ByN8Z_PbaQ29Ew%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1hoiLDtpwoQ9o-dNB7vHOTiJ_3srZ61Y5FUVvXOSHEGg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.