Just Pushed: Numeric Range Filter


(Shay Banon) #1

Hi,

Just pushed support for numeric_range filter. Its exactly like the range
filter in syntax, but uses the field data cache to perform the range filter
instead of the regular (lucene) range filter.

How does it work? The regular range filter (which also handles numeric
values) uses the structure of how numeric data is indexed (Trie based) to
fetch all the matching docs and create a bitset for it (each position maps
to a doc_id, bit with value 1 means a hit). This is always computed,
regardless of the query executed. The result of the filter is cached by
elasticsearch, so any subsequent calls using the same range values will be
really fast as they don't have to be computed again. The reason this is
always cached is because it is already in the form (bitset) of a cached
filter result.

The numeric_range filter uses the the field data cache in order to do the
filtering. The field data cache basically uninverts the index, and stores
the value(s) of a field indexed by doc id. The field data cache is used when
sorting, or when using facets. This will usually be much faster than the
regular range filter, as it will only compute and filter per doc (that the
master query matches on) and the computation is really fast. This comes at
the cost of loading all the field values to memory, which might be ok if
already using it for faceting / sorting. This filter result is not cached
by default (as caching requires passing all docs and computing against it).

When do you which? If you have an age filter for "teens" (>10, <20), then
using the regular range filter is great choice. This filter is going to be
repeated a lot in different search operations, and the range filter caches
the results. No need to load the age field into the data cache.

If, on the other hand, a range filter that can't be cached easily (since
it does not have repetitive fomr/to) then the numeric_range is a great
candidate to give that query a boost.

As a side note, the best solution is for things to be automatic and not
exposed to the user (at least the defaults should be). I am working on
trying to write something that will automatically use the numeric_range
filter if the field is already loaded on the field data cache.

-shay.banon


(Otis Gospodnetić) #2

Hi,

Q: why is this called the numeric_range filter when it sounds like
(I could be wrong, of course), that the main distinction between this
and the regular range filters (which also handle numerical ranges) is
that the new one is cached?

Thanks,
Otis

On Oct 16, 1:09 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

Just pushed support for numeric_range filter. Its exactly like the range
filter in syntax, but uses the field data cache to perform the range filter
instead of the regular (lucene) range filter.

How does it work? The regular range filter (which also handles numeric
values) uses the structure of how numeric data is indexed (Trie based) to
fetch all the matching docs and create a bitset for it (each position maps
to a doc_id, bit with value 1 means a hit). This is always computed,
regardless of the query executed. The result of the filter is cached by
elasticsearch, so any subsequent calls using the same range values will be
really fast as they don't have to be computed again. The reason this is
always cached is because it is already in the form (bitset) of a cached
filter result.

The numeric_range filter uses the the field data cache in order to do the
filtering. The field data cache basically uninverts the index, and stores
the value(s) of a field indexed by doc id. The field data cache is used when
sorting, or when using facets. This will usually be much faster than the
regular range filter, as it will only compute and filter per doc (that the
master query matches on) and the computation is really fast. This comes at
the cost of loading all the field values to memory, which might be ok if
already using it for faceting / sorting. This filter result is not cached
by default (as caching requires passing all docs and computing against it).

When do you which? If you have an age filter for "teens" (>10, <20), then
using the regular range filter is great choice. This filter is going to be
repeated a lot in different search operations, and the range filter caches
the results. No need to load the age field into the data cache.

If, on the other hand, a range filter that can't be cached easily (since
it does not have repetitive fomr/to) then the numeric_range is a great
candidate to give that query a boost.

As a side note, the best solution is for things to be automatic and not
exposed to the user (at least the defaults should be). I am working on
trying to write something that will automatically use the numeric_range
filter if the field is already loaded on the field data cache.

-shay.banon


(Shay Banon) #3

Hey,

The filter itself (the result of executing it) is not cached, thats not
the difference. The difference is that the numeric_range filter work with a
construct similar to lucene FieldCache to execute the filter.

The results of both filters can be cached as well. The range filter result
is cached by default, while numberic_range is not.

To be honest, not too crazy with the name itself, just that something
like: field_data_cache_numeric_range is a bit too much :wink:

-shay.banon

On Sun, Oct 17, 2010 at 2:58 PM, Otis otis.gospodnetic@gmail.com wrote:

Hi,

Q: why is this called the numeric_range filter when it sounds like
(I could be wrong, of course), that the main distinction between this
and the regular range filters (which also handle numerical ranges) is
that the new one is cached?

Thanks,
Otis

On Oct 16, 1:09 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

Just pushed support for numeric_range filter. Its exactly like the
range
filter in syntax, but uses the field data cache to perform the range
filter
instead of the regular (lucene) range filter.

How does it work? The regular range filter (which also handles numeric
values) uses the structure of how numeric data is indexed (Trie based) to
fetch all the matching docs and create a bitset for it (each position
maps
to a doc_id, bit with value 1 means a hit). This is always computed,
regardless of the query executed. The result of the filter is cached by
elasticsearch, so any subsequent calls using the same range values will
be
really fast as they don't have to be computed again. The reason this is
always cached is because it is already in the form (bitset) of a cached
filter result.

The numeric_range filter uses the the field data cache in order to do
the
filtering. The field data cache basically uninverts the index, and stores
the value(s) of a field indexed by doc id. The field data cache is used
when
sorting, or when using facets. This will usually be much faster than the
regular range filter, as it will only compute and filter per doc (that
the
master query matches on) and the computation is really fast. This comes
at
the cost of loading all the field values to memory, which might be ok if
already using it for faceting / sorting. This filter result is not
cached
by default (as caching requires passing all docs and computing against
it).

When do you which? If you have an age filter for "teens" (>10, <20),
then
using the regular range filter is great choice. This filter is going to
be
repeated a lot in different search operations, and the range filter
caches
the results. No need to load the age field into the data cache.

If, on the other hand, a range filter that can't be cached easily
(since
it does not have repetitive fomr/to) then the numeric_range is a great
candidate to give that query a boost.

As a side note, the best solution is for things to be automatic and
not
exposed to the user (at least the defaults should be). I am working on
trying to write something that will automatically use the numeric_range
filter if the field is already loaded on the field data cache.

-shay.banon


(system) #4