Filter bitsets

I'm trying to optimize filter queries for performance and am slightly
confused by the online docs. Looking at:

  1. https://www.elastic.co/blog/all-about-elasticsearch-filter-bitsets

http://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-and-filter.html
3)
http://www.elastic.co/guide/en/elasticsearch/guide/current/_filter_order.html

#1 says that Bool filter uses bitsets, while And/Or/Not does doc-by-doc
matching.
#2 says that And result is optionally cacheable (implying that it uses
bitsets).
#3 says that Bool does doc-by-doc matching if the inner filters are not
cacheable.

This is confusing, is there a clear guideline on when bitsets are used?

Let's say I have two high-cardinality fields, x and y. Field data for y is
loaded into memory, while x is not. What is the optimal way to structure
this query?

  "filter": {
    "and": [
    {
      "term": {
        "x": "F828477AF7",
"_cache": false  // Don't want to cache since query will not be repeated
      }
},

{
"range": {
"y": {
"gt": "CB70V63BD8AE // String range query, should only be
executed on result of previous filters
}
}
}
]
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/52dd306b-d229-462b-8b3c-b9cb2fff8c5f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

There are several concepts:

  • filter operation (bool, range/geo/script)
  • filter composition (composable or not, composable means bitsets are used)
  • filter caching (ES stores filter results or not, if not cached, ES must
    walk doc-by-doc to apply filter)

#1 says you should take care what kind of inner filter the and/or/not
filter uses, and then you should arrange filters in the right order to
avoid unnecessary complexity
#2 most of the filters are cacheable, but not by default. These doc try to
explain how the "and" filter consists of inner filter clauses and what is
happening because default caching is off. I can not see this is implying
bitsets.
#3 correct interpretation

The use of bitsets is a pointer for composable filters, these
should/must/mustnot filters use an internal Lucene bitset implementation
for efficient computation.

Jörg

On Thu, Mar 19, 2015 at 5:58 AM, Ashish Mishra laughingbuddha@gmail.com
wrote:

I'm trying to optimize filter queries for performance and am slightly
confused by the online docs. Looking at:

  1. All About Elasticsearch Filter BitSets | Elastic Blog

http://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-and-filter.html
3)
http://www.elastic.co/guide/en/elasticsearch/guide/current/_filter_order.html

#1 says that Bool filter uses bitsets, while And/Or/Not does doc-by-doc
matching.
#2 says that And result is optionally cacheable (implying that it uses
bitsets).
#3 says that Bool does doc-by-doc matching if the inner filters are not
cacheable.

This is confusing, is there a clear guideline on when bitsets are used?

Let's say I have two high-cardinality fields, x and y. Field data for y
is loaded into memory, while x is not. What is the optimal way to
structure this query?

  "filter": {
    "and": [
    {
      "term": {
        "x": "F828477AF7",
"_cache": false  // Don't want to cache since query will not be

repeated
}
},
{
"range": {
"y": {
"gt": "CB70V63BD8AE // String range query, should only be
executed on result of previous filters
}
}
}
]
}

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/52dd306b-d229-462b-8b3c-b9cb2fff8c5f%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/52dd306b-d229-462b-8b3c-b9cb2fff8c5f%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFnpMNhmgbvMapMWMjCYRO9ZF%2BjXYrJZo1R8st0FSKPKQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Not sure I understand the difference between composable vs. cacheable. Can
filters be cached without using bitsets? What format are the results
stored in, if not as bitsets?

In the example below, would the string range field "y" filter be evaluated
on every document in the index, or just on the documents matching the
previous field "x" filter?

Also, will "y" field data be loaded for all documents in the index, or just
for the documents matching the previous filter.

On Thursday, March 19, 2015 at 3:21:12 AM UTC-7, Jörg Prante wrote:

There are several concepts:

  • filter operation (bool, range/geo/script)
  • filter composition (composable or not, composable means bitsets are used)
  • filter caching (ES stores filter results or not, if not cached, ES must
    walk doc-by-doc to apply filter)

#1 says you should take care what kind of inner filter the and/or/not
filter uses, and then you should arrange filters in the right order to
avoid unnecessary complexity
#2 most of the filters are cacheable, but not by default. These doc try to
explain how the "and" filter consists of inner filter clauses and what is
happening because default caching is off. I can not see this is implying
bitsets.
#3 correct interpretation

The use of bitsets is a pointer for composable filters, these
should/must/mustnot filters use an internal Lucene bitset implementation
for efficient computation.

Jörg

On Thu, Mar 19, 2015 at 5:58 AM, Ashish Mishra <laughin...@gmail.com
<javascript:>> wrote:

I'm trying to optimize filter queries for performance and am slightly
confused by the online docs. Looking at:

  1. All About Elasticsearch Filter BitSets | Elastic Blog

http://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-and-filter.html
3)
http://www.elastic.co/guide/en/elasticsearch/guide/current/_filter_order.html

#1 says that Bool filter uses bitsets, while And/Or/Not does doc-by-doc
matching.
#2 says that And result is optionally cacheable (implying that it uses
bitsets).
#3 says that Bool does doc-by-doc matching if the inner filters are not
cacheable.

This is confusing, is there a clear guideline on when bitsets are used?

Let's say I have two high-cardinality fields, x and y. Field data for y
is loaded into memory, while x is not. What is the optimal way to
structure this query?

  "filter": {
    "and": [
    {
      "term": {
        "x": "F828477AF7",
"_cache": false  // Don't want to cache since query will not be 

repeated
}
},
{
"range": {
"y": {
"gt": "CB70V63BD8AE // String range query, should only
be executed on result of previous filters
}
}
}
]
}

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/52dd306b-d229-462b-8b3c-b9cb2fff8c5f%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/52dd306b-d229-462b-8b3c-b9cb2fff8c5f%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0dbceece-5c74-4867-90df-951f8f0cae8a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Caching filters are implemented in ES, not in Lucene. E.g.
org.elasticsearch,common.lucene.search.CachedFilter is a class that
implements cached filters on the base of Lucene filter class.

The "format" is not only bitsets. The Lucene filter instance is cached, no
matter if it is doc sets or bit sets or whatever. ES code extends Lucene
filters by several methods for fast evaluation and traversal.

ES evaluates the filter in the given filter chain order, from outer to
inner (also called "top down").

When a series of boolean filters (i.e. should/must/must_not) is used, they
can be evaluated efficiently by composition. See
org.elasticsearch,common.lucene.search.XBooleanFilter for the composition
algorithm.

Field data will be loaded when a field is used for operations like filter
or sort. The higher the cardinality, the more effort is needed. This is
because the index is inverted.

Jörg

On Fri, Mar 20, 2015 at 3:30 AM, Ashish Mishra laughingbuddha@gmail.com
wrote:

Not sure I understand the difference between composable vs. cacheable.
Can filters be cached without using bitsets? What format are the results
stored in, if not as bitsets?

In the example below, would the string range field "y" filter be evaluated
on every document in the index, or just on the documents matching the
previous field "x" filter?

Also, will "y" field data be loaded for all documents in the index, or
just for the documents matching the previous filter.

On Thursday, March 19, 2015 at 3:21:12 AM UTC-7, Jörg Prante wrote:

There are several concepts:

  • filter operation (bool, range/geo/script)
  • filter composition (composable or not, composable means bitsets are
    used)
  • filter caching (ES stores filter results or not, if not cached, ES must
    walk doc-by-doc to apply filter)

#1 says you should take care what kind of inner filter the and/or/not
filter uses, and then you should arrange filters in the right order to
avoid unnecessary complexity
#2 most of the filters are cacheable, but not by default. These doc try
to explain how the "and" filter consists of inner filter clauses and what
is happening because default caching is off. I can not see this is implying
bitsets.
#3 correct interpretation

The use of bitsets is a pointer for composable filters, these
should/must/mustnot filters use an internal Lucene bitset implementation
for efficient computation.

Jörg

On Thu, Mar 19, 2015 at 5:58 AM, Ashish Mishra laughin...@gmail.com
wrote:

I'm trying to optimize filter queries for performance and am slightly
confused by the online docs. Looking at:

  1. All About Elasticsearch Filter BitSets | Elastic Blog
  2. Elasticsearch Guide | Elastic
    current/query-dsl-and-filter.html
  3. Elasticsearch - The Definitive Guide | Elastic
    current/_filter_order.html

#1 says that Bool filter uses bitsets, while And/Or/Not does doc-by-doc
matching.
#2 says that And result is optionally cacheable (implying that it uses
bitsets).
#3 says that Bool does doc-by-doc matching if the inner filters are not
cacheable.

This is confusing, is there a clear guideline on when bitsets are used?

Let's say I have two high-cardinality fields, x and y. Field data for y
is loaded into memory, while x is not. What is the optimal way to
structure this query?

  "filter": {
    "and": [
    {
      "term": {
        "x": "F828477AF7",
"_cache": false  // Don't want to cache since query will not be

repeated
}
},
{
"range": {
"y": {
"gt": "CB70V63BD8AE // String range query, should only
be executed on result of previous filters
}
}
}
]
}

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/52dd306b-d229-462b-8b3c-b9cb2fff8c5f%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/52dd306b-d229-462b-8b3c-b9cb2fff8c5f%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0dbceece-5c74-4867-90df-951f8f0cae8a%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0dbceece-5c74-4867-90df-951f8f0cae8a%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEN4zZO75z3_chJKnCSHysSnD0FvnC-Wet1_TGn2ZL5eg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks for the clarification - this was very helpful.

On Friday, March 20, 2015 at 5:45:59 AM UTC-7, Jörg Prante wrote:

Caching filters are implemented in ES, not in Lucene. E.g.
org.elasticsearch,common.lucene.search.CachedFilter is a class that
implements cached filters on the base of Lucene filter class.

The "format" is not only bitsets. The Lucene filter instance is cached, no
matter if it is doc sets or bit sets or whatever. ES code extends Lucene
filters by several methods for fast evaluation and traversal.

ES evaluates the filter in the given filter chain order, from outer to
inner (also called "top down").

When a series of boolean filters (i.e. should/must/must_not) is used, they
can be evaluated efficiently by composition. See
org.elasticsearch,common.lucene.search.XBooleanFilter for the composition
algorithm.

Field data will be loaded when a field is used for operations like filter
or sort. The higher the cardinality, the more effort is needed. This is
because the index is inverted.

Jörg

On Fri, Mar 20, 2015 at 3:30 AM, Ashish Mishra <laughin...@gmail.com
<javascript:>> wrote:

Not sure I understand the difference between composable vs. cacheable.
Can filters be cached without using bitsets? What format are the results
stored in, if not as bitsets?

In the example below, would the string range field "y" filter be
evaluated on every document in the index, or just on the documents matching
the previous field "x" filter?

Also, will "y" field data be loaded for all documents in the index, or
just for the documents matching the previous filter.

On Thursday, March 19, 2015 at 3:21:12 AM UTC-7, Jörg Prante wrote:

There are several concepts:

  • filter operation (bool, range/geo/script)
  • filter composition (composable or not, composable means bitsets are
    used)
  • filter caching (ES stores filter results or not, if not cached, ES
    must walk doc-by-doc to apply filter)

#1 says you should take care what kind of inner filter the and/or/not
filter uses, and then you should arrange filters in the right order to
avoid unnecessary complexity
#2 most of the filters are cacheable, but not by default. These doc try
to explain how the "and" filter consists of inner filter clauses and what
is happening because default caching is off. I can not see this is implying
bitsets.
#3 correct interpretation

The use of bitsets is a pointer for composable filters, these
should/must/mustnot filters use an internal Lucene bitset implementation
for efficient computation.

Jörg

On Thu, Mar 19, 2015 at 5:58 AM, Ashish Mishra laughin...@gmail.com
wrote:

I'm trying to optimize filter queries for performance and am slightly
confused by the online docs. Looking at:

  1. All About Elasticsearch Filter BitSets | Elastic Blog
  2. Elasticsearch Guide | Elastic
    current/query-dsl-and-filter.html
  3. Elasticsearch - The Definitive Guide | Elastic
    current/_filter_order.html

#1 says that Bool filter uses bitsets, while And/Or/Not does doc-by-doc
matching.
#2 says that And result is optionally cacheable (implying that it uses
bitsets).
#3 says that Bool does doc-by-doc matching if the inner filters are not
cacheable.

This is confusing, is there a clear guideline on when bitsets are used?

Let's say I have two high-cardinality fields, x and y. Field data for
y is loaded into memory, while x is not. What is the optimal way to
structure this query?

  "filter": {
    "and": [
    {
      "term": {
        "x": "F828477AF7",
"_cache": false  // Don't want to cache since query will not be 

repeated
}
},
{
"range": {
"y": {
"gt": "CB70V63BD8AE // String range query, should only
be executed on result of previous filters
}
}
}
]
}

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/52dd306b-d229-462b-8b3c-b9cb2fff8c5f%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/52dd306b-d229-462b-8b3c-b9cb2fff8c5f%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0dbceece-5c74-4867-90df-951f8f0cae8a%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0dbceece-5c74-4867-90df-951f8f0cae8a%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2a849fa9-2286-4e37-ac49-4d08a0202e3d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.