I have a search request that uses a couple of filters. I'm using bool+must,
and I'm trying to optimize the request as much as possible.
Some filters are used by all users of my platform, but aren't very
selective.
Some filters are very specific to individual users, and are highly
selective.
I've read that I should use the most selective filters first, to ease the
work performed by the subsequent filters.
However one thing that's not 100% clear is how the filter cache bitmaps
works. Do they store the result of a filter if performed across the entire
dataset, or does it store the filtered result of the previous filter's
output?
Example. Querying the paid invoices of an account:
{ "query":
{ "filtered":
{ "filter":
{ "bool":
{" must": [
{ "term": { "status": "paid" } }, (all users use this, but
it's not very selective)
{ "term": { "account": "123456" } }
]}
}
}
}
}
Following the advice of using the most highly selective filter first, I
should place the "account" filter first. On the other hand I want to be
sure that all users will re-use the cached output of the "status" filter.
Question: will the "status" filter cache contain all paid invoices of all
accounts, no matter in which order I use the filters?
The above code is just an example - I'm trying to optimize the code for a
dataset for 1B+ documents, so please take this into consideration.
The status filter cache will indeed contain all entries. And technically,
the cache is per segment, and not across all documents, but this should be
transparent.
Caching is enabled by default for the term filters, but disabled for the
bool filter. You can enable it if you think users will be reusing the
filter.
I have a search request that uses a couple of filters. I'm using
bool+must, and I'm trying to optimize the request as much as possible.
Some filters are used by all users of my platform, but aren't very
selective.
Some filters are very specific to individual users, and are highly
selective.
I've read that I should use the most selective filters first, to ease the
work performed by the subsequent filters.
However one thing that's not 100% clear is how the filter cache bitmaps
works. Do they store the result of a filter if performed across the entire
dataset, or does it store the filtered result of the previous filter's
output?
Example. Querying the paid invoices of an account:
{ "query":
{ "filtered":
{ "filter":
{ "bool":
{" must": [
{ "term": { "status": "paid" } }, (all users use this, but
it's not very selective)
{ "term": { "account": "123456" } }
]}
}
}
}
}
Following the advice of using the most highly selective filter first, I
should place the "account" filter first. On the other hand I want to be
sure that all users will re-use the cached output of the "status" filter.
Question: will the "status" filter cache contain all paid invoices of
all accounts, no matter in which order I use the filters?
The above code is just an example - I'm trying to optimize the code for a
dataset for 1B+ documents, so please take this into consideration.
A follow-up question. If caching the filter for a specific value, say "{
"term": { "status": "paid" } }", will this somehow magically speed up the
query if searching for "status": "unpaid"? I'm not talking about a "not"
operation, but simply replacing the value with something else (like when
creating an index in a RDBMS).
The status filter cache will indeed contain all entries. And technically,
the cache is per segment, and not across all documents, but this should be
transparent.
Caching is enabled by default for the term filters, but disabled for the
bool filter. You can enable it if you think users will be reusing the
filter.
I have a search request that uses a couple of filters. I'm using
bool+must, and I'm trying to optimize the request as much as possible.
Some filters are used by all users of my platform, but aren't very
selective.
Some filters are very specific to individual users, and are highly
selective.
I've read that I should use the most selective filters first, to ease the
work performed by the subsequent filters.
However one thing that's not 100% clear is how the filter cache bitmaps
works. Do they store the result of a filter if performed across the entire
dataset, or does it store the filtered result of the previous filter's
output?
Example. Querying the paid invoices of an account:
{ "query":
{ "filtered":
{ "filter":
{ "bool":
{" must": [
{ "term": { "status": "paid" } }, (all users use this, but
it's not very selective)
{ "term": { "account": "123456" } }
]}
}
}
}
}
Following the advice of using the most highly selective filter first, I
should place the "account" filter first. On the other hand I want to be
sure that all users will re-use the cached output of the "status" filter.
Question: will the "status" filter cache contain all paid invoices of
all accounts, no matter in which order I use the filters?
The above code is just an example - I'm trying to optimize the code for a
dataset for 1B+ documents, so please take this into consideration.
Term filters already use lucene's term dictionary as an index. Almost
everything Elasticsearch does uses it. In fact term queries are so fast
that Elasticsearch switched them from being cached by default to uncached
by default (don't have version number handy). For the most part I wouldn't
worry about them. When I have a request that is slow I tend to remove parts
of it until I find the slow bit. That works well if the speed issue is CPU
related which most stuff seems to be.
A follow-up question. If caching the filter for a specific value, say "{
"term": { "status": "paid" } }", will this somehow magically speed up the
query if searching for "status": "unpaid"? I'm not talking about a "not"
operation, but simply replacing the value with something else (like when
creating an index in a RDBMS).
The status filter cache will indeed contain all entries. And technically,
the cache is per segment, and not across all documents, but this should be
transparent.
Caching is enabled by default for the term filters, but disabled for the
bool filter. You can enable it if you think users will be reusing the
filter.
I have a search request that uses a couple of filters. I'm using
bool+must, and I'm trying to optimize the request as much as possible.
Some filters are used by all users of my platform, but aren't very
selective.
Some filters are very specific to individual users, and are highly
selective.
I've read that I should use the most selective filters first, to ease
the work performed by the subsequent filters.
However one thing that's not 100% clear is how the filter cache bitmaps
works. Do they store the result of a filter if performed across the entire
dataset, or does it store the filtered result of the previous filter's
output?
Example. Querying the paid invoices of an account:
{ "query":
{ "filtered":
{ "filter":
{ "bool":
{" must": [
{ "term": { "status": "paid" } }, (all users use this, but
it's not very selective)
{ "term": { "account": "123456" } }
]}
}
}
}
}
Following the advice of using the most highly selective filter first, I
should place the "account" filter first. On the other hand I want to be
sure that all users will re-use the cached output of the "status" filter.
Question: will the "status" filter cache contain all paid invoices of
all accounts, no matter in which order I use the filters?
The above code is just an example - I'm trying to optimize the code for
a dataset for 1B+ documents, so please take this into consideration.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.