Advice needed for searching: filters vrs. queries

brian_yoder · March 13, 2013, 6:44pm

I currently implement all my production application client queries directly
in Java, and use a BoolQueryBuilder to wrap all of my indexed field
queries. I currently only use a filter for geo distance queries. The
toString method creates a very nice pretty-printed JSON form of the search
that the Java API can accept for testing and demonstration purposes.

http://jontai.me/blog/2012/10/using-elasticsearch-to-speed-up-filtering/ is
an interesting article. I'm not using MongoDB; when using ElasticSearch it
is my one and only DB. So keeping _source enabled is necessary. And I've
already disabled the _all field and seen the greatly improved build results
he sees.

But the migration from queries to filters is what caught my attention. I
had already been looking at this, and have some question:

Instead of using the static QueryBuilders.boolQuery method to create a
BoolQueryBuilder, I was considering using a FilterBuilders.boolFilter
method to create a BoolFilterBuilder. It seems to have the the andFilter,
notFilter, and orFilter counterparts to the BoolQueryBuilder's must,
mustNot, and should methods. Is the only difference between queries and
filters really just scoring?

Do I really need to create a QueryBuilders.matchAll query builder and then
add filters to it?

Of course, there doesn't seem to be a counterpart for phrase matching in
the filter query world. So when I detect a blank inside a term string, I
create a phrase match query as follows:

MatchQueryBuilder mqb = matchPhraseQuery(field, qterm.getValue());
mqb.slop(qterm.getSlop());
return mqb;

But by default, I use the fieldQueryBuilder, since it automatically
recognizes strings such as A+B as a phrase, and it also recognizes certain
Chinese characters as individual words of a phrase. Very nice, and fully
compatible with values of one term or a phrase.

FieldQueryBuilder fqb = fieldQuery(field, qterm.getValue());
fqb.defaultOperator(FieldQueryBuilder.Operator.AND);
fqb.autoGeneratePhraseQueries(true);
fqb.enablePositionIncrements(true);
fqb.phraseSlop(qterm.getSlop());
return fqb;

Is there some requirement or benefit to constructing a search using a
top-level QueryBuilders.matchAll and adding the complex tree of filter
builders to it? Or can I bypass the query builders? Or is phrase matching
something that makes it impossible to generically throw either a single
term or a phrase into the query (as I can easily do with query builders).

The caching isn't all that interesting: Ad-hoc queries that are complex
vary widely, and are rarely the same from call to call and across clients.
So once the search engine is warmed up, the non-cached steady state
response times are the most interesting. (For example, a cached query can
return in a few milliseconds, but the first instance of that query took 35
seconds and it only returned 2 matches across those 78M documents.)

Or do I really need to wait until I throw enough machines at this to wring
out the best performance from ad-hoc complex searches?

In the meantime, my application's most commonly used query is get-by-ID
(index.type.id) and that performs brilliantly fast when not cached, even
for databases that approach 100M documents. So I have some time to research
and experiment.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Clinton_Gormley · March 14, 2013, 11:17am

Hiya

But the migration from queries to filters is what caught my attention.
I had already been looking at this, and have some question:

Instead of using the static QueryBuilders.boolQuery method to create a
BoolQueryBuilder, I was considering using a FilterBuilders.boolFilter
method to create a BoolFilterBuilder. It seems to have the the
andFilter, notFilter, and orFilter counterparts to the
BoolQueryBuilder's must, mustNot, and should methods. Is the only
difference between queries and filters really just scoring?

Filters are faster, because they are simpler. They don't have to do any
scoring. On top of that, most filters can be cached in a compact bitset,
making them even faster when you reuse them.

Filters don't do full text analysis, and don't do scoring (although they
can be combined with the custom_filters_score query to influence
scoring).

So yes. Use filters wherever you can.

Bool filter vs and/or/not:

The bool filter consumes bitsets. Most filters produce bitsets, eg a
filter like { term: { status: "active" }} will examine every document in
the index and create a bitset for the entire index (one bit per
document) which contains '1' if the document matches, and '0' if it
doesn't.

Combining these bitsets is very efficient.

However, certain filters (geo filters and numeric_range) don't create
bitsets. They examine each doc in turn. Running a geo-distance filter
on every doc in the index is heavy. You want to avoid that.

and/or/not filters don't demand bitsets. They work doc-by-doc, so
they're a good fit for geo filters. They also short-circuit. If a doc
has already been excluded by an earlier filter, it won't run the later
filters.

So to put it all together, combine the bitset filters with a bool
filter, and then combine the bool filter with the geo filter using an
'and' clause, with the geo-filter after the and (see example below)

Do I really need to create a QueryBuilders.matchAll query builder and
then add filters to it?

You can use a filtered query, or a constant score query:

curl -XGET 'http://127.0.0.1:9200/_all/_search?pretty=1' -d '
{
"query" : {
"constant_score" : {
"filter" : {
"and" : [
{
"bool" : {
"must" : [
{
"term" : {
"status" : "active"
}
},
{
"range" : {
"date" : {
"gte" : "2013-01-01"
}
}
}
]
}
},
{
"geo_distance" : {
"distance" : "10km",
"location" : [
0,
0
]
}
}
]
}
}
}
}
'

Of course, there doesn't seem to be a counterpart for phrase matching
in the filter query world. So when I detect a blank inside a term
string, I create a phrase match query as follows:

Correct - any part of the query that relates to "full text search" is
better off handled by queries.

But by default, I use the fieldQueryBuilder, since it automatically
recognizes strings such as A+B as a phrase, and it also recognizes
certain Chinese characters as individual words of a phrase. Very nice,
and fully compatible with values of one term or a phrase.

The field/query_string query can be useful and powerful, but is
problematic. First, formatting the query correctly can be tricky - quite
often it is not obvious that it is not running the query that you expect
(it's a complicated syntax). Second, any syntax error will just cause
the query to fail - no results. Third, you're exposing your search to
abuse by very heavy queries, eg "a* b* c* d* e* ..." etc

I think that search keywords should be parsed by your application to
allow just the queries that you specifically want to allow.

Is there some requirement or benefit to constructing a search using a
top-level QueryBuilders.matchAll and adding the complex tree of filter
builders to it? Or can I bypass the query builders? Or is phrase
matching something that makes it impossible to generically throw
either a single term or a phrase into the query (as I can easily do
with query builders).

full text search -> use queries

use filters for the kind of thing you would normally express with SQL:

WHERE id IN (1,2,3)
AND ( date >= '2013-01-01' OR featured )
AND status = 'active'

that kind of thing.

The caching isn't all that interesting: Ad-hoc queries that are
complex vary widely, and are rarely the same from call to call and
across clients. So once the search engine is warmed up, the non-cached
steady state response times are the most interesting. (For example, a
cached query can return in a few milliseconds, but the first instance
of that query took 35 seconds and it only returned 2 matches across
those 78M documents.)

Queries aren't cached, but a query is faster once the data required to
run the query is loaded into the kernel filesystem caches (which I think
is what you mean). Having lots of kernel cache space is good for
performance.

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

brian_yoder · March 14, 2013, 6:50pm

Thank you so much for the carefully written and detailed explanation. It
will give me a lot of things to think about. And it would be an excellent
first draft for a tutorial on this subject!

Taking your suggestion to avoid the match_all query, I made a small change
to my test client. For this test, I was looking for all USA cities within
50km of San Jose, IL (there are two of them, including San Jose, IL
itself). The only geocoded data I have right now is the list of US and
Puerto Rica cities from the US Census. So I never noticed that my plain
distance query (find all geocoded things within 50km of a specified center
point) had a performance trap in it.

Here are the original and updated forms of that distance query. They're
actually implemented in the Java API, but my test client can emit the
pretty-printed JSON form that can then be pasted directly into a curl
command to make everyone on this newsgroup happy!

Original query:

curl -XGET 'http://localhost:9200/census/_search?pretty=true' -d'
{
"from" : 0,
"size" : 20,
"query" : {
"filtered" : {

```
 "query" : {*
```
```
   "bool" : {*
```
```
     "must" : {*
```
```
       "match_all" : { }*
```
```
     }*
```
```
   }*
```
```
 },*
```
```
 "filter"* : {
  "geo_distance" : {
    "location" : [ -89.604788, 40.303962 ],
    "distance" : "10.0km",
    "distance_type" : "arc"
  }
}
```
}
},
"version" : true,
"explain" : false,
"sort" : [ {
"_geo_distance" : {
"location" : [ -89.604788, 40.303962 ],
"distance_type" : "arc"
}
} ]
}'
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : null,
"hits" : [ {
"_index" : "census",
"_type" : "locality",
"_id" : "hr0GXt2ySYa_LUD2XdIPXQ",
"_version" : 1,
"_score" : null, "_source" : { "city" : "San Jose", "state" : "IL",
"location" : [ -89.604788, 40.303962 ] },
"sort" : [ 0.0 ]
}, {
"_index" : "census",
"_type" : "locality",
"_id" : "8OAy1qLVSLab8g48_m3QCg",
"_version" : 1,
"_score" : null, "_source" : { "city" : "Delavan", "state" : "IL",
"location" : [ -89.545651, 40.370835 ] },
"sort" : [ 8.967529897116258 ]
} ]
}
}

Updated query based on your suggestion:

curl -XGET 'http://localhost:9200/census/_search?pretty=true' -d'
{
"from" : 0,
"size" : 20,
"query" : {
"constant_score" : {
"filter" : {
"geo_distance" : {
"location" : [ -89.604788, 40.303962 ],
"distance" : "10.0km",
"distance_type" : "arc"
}
}
}
},
"version" : true,
"explain" : false,
"sort" : [ {
"_geo_distance" : {
"location" : [ -89.604788, 40.303962 ],
"distance_type" : "arc"
}
} ]
}'

{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : null,
"hits" : [ {
"_index" : "census",
"_type" : "locality",
"_id" : "hr0GXt2ySYa_LUD2XdIPXQ",
"_version" : 1,
"_score" : null, "_source" : { "city" : "San Jose", "state" : "IL",
"location" : [ -89.604788, 40.303962 ] },
"sort" : [ 0.0 ]
}, {
"_index" : "census",
"_type" : "locality",
"_id" : "8OAy1qLVSLab8g48_m3QCg",
"_version" : 1,
"_score" : null, "_source" : { "city" : "Delavan", "state" : "IL",
"location" : [ -89.545651, 40.370835 ] },
"sort" : [ 8.967529897116258 ]
} ]
}
}

They both give the same results, and after a couple of runs each, they both
return in about 1 ms. But the second query was faster the first time.
(Small data set makes it difficult to truly measure the performance of this
particular type of query).

It's take some some time to digest the rest. Since I first wrote the
client, I've finally (with some help from this newsgroup!) mastered the
index settings and mappings, and have even created a tool that converts a
very simple high-level schema into the settings and mappings with all the
trimmings (character filters, token filters, custom analyzers, with all the
right languages, as needed). This makes it very nice to change mappings on
a whim during expermentation and testing. So now that I have your guidance,
I can take the next steps much more easily.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Clinton_Gormley · March 15, 2013, 11:15am

They both give the same results, and after a couple of runs each, they
both return in about 1 ms. But the second query was faster the first
time. (Small data set makes it difficult to truly measure the
performance of this particular type of query).

Actually, the difference between a filtered query with match_all and a
filter, and just a constant_score query with a filter should be minimal.
Any differences you saw were probably due to file system caching.

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

shlomivaknin · March 17, 2013, 2:25pm

Hey,

you are right, i dont see a difference between match_all + filter and
constant_score + filter, but searching this way comes always slower then
term query..

here is my example:
"query": {
"query_string": {
"query": "+test +another",
}
}

comes back after 2ms (average after many attempts) :
{:took 2, :timed_out false, :_shards {:total 5, :successful 5, :failed 0},
:hits {:total 1, :max_score 18.649187, :hits [{:_index test-index, :_type
test, :_id OG0xbcF-TEuNWCSGlSENhw, :_score 18.649187, :_source {...}}]}}

while both:
{:constant_score {:filter {:bool {:must [{:term {:gram "test"}} {:term
{:gram "another"}}]}}}}
and
{query: match_all} {:bool {:must [{:term {:gram "the_thing_is"}} {:term
{:gram "what_if_the"}}]}}

comes back after 23ms (average after many attempts):
{:took 23, :timed_out false, :_shards {:total 5, :successful 5, :failed 0},
:hits {:total 1, :max_score 1.0, :hits [{:_index test-index, :_type test,
:_id OG0xbcF-TEuNWCSGlSENhw, :_score 1.0, :_source {...}}]}}

how does that make sense?

On Friday, March 15, 2013 1:15:04 PM UTC+2, Clinton Gormley wrote:

They both give the same results, and after a couple of runs each, they
both return in about 1 ms. But the second query was faster the first
time. (Small data set makes it difficult to truly measure the
performance of this particular type of query).

Actually, the difference between a filtered query with match_all and a
filter, and just a constant_score query with a filter should be minimal.
Any differences you saw were probably due to file system caching.

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Clinton_Gormley · March 18, 2013, 9:18am

while both:

{:constant_score {:filter {:bool {:must [{:term {:gram "test"}} {:term
{:gram "another"}}]}}}}

and
{query: match_all} {:bool {:must [{:term {:gram "the_thing_is"}}
{:term {:gram "what_if_the"}}]}}

Can you post the full search command that you use, in curl style - from
the above I don't know what phase you are using to run these particular
clauses

clint

comes back after 23ms (average after many attempts):
{:took 23, :timed_out false, :_shards {:total 5, :successful
5, :failed 0}, :hits {:total 1, :max_score 1.0, :hits [{:_index
test-index, :_type test, :_id OG0xbcF-TEuNWCSGlSENhw, :_score
1.0, :_source {...}}]}}

how does that make sense?

On Friday, March 15, 2013 1:15:04 PM UTC+2, Clinton Gormley wrote:
    > 
    > They both give the same results, and after a couple of runs
    each, they 
    > both return in about 1 ms. But the second query was faster
    the first 
    > time. (Small data set makes it difficult to truly measure
    the 
    > performance of this particular type of query). 
    
    Actually, the difference between a filtered query with
    match_all and a 
    filter, and just a constant_score query with a filter should
    be minimal. 
    Any differences you saw were probably due to file system
    caching. 
    
    clint 
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

shlomivaknin · March 18, 2013, 2:06pm

sure, here are both queries:
curl -XGET http://es-test:9200/test-index/test-type/_search/ -d
'{"filter":{"bool":{"must":[{"term":{"gram":"test"}},{"term":{"gram":"another"}}]}},"query":{"match_all":{}}}'
curl -XGET http://es-test:9200/test-index/test-type/_search/ -d '{"query":
{"constant_score": {"filter": {"bool": {"must": [{"term": {"gram": "test"}},
{"term": {"gram": "another"}}]}}}}}'

they both get me >20 ms,

while he following:
curl -XGET http://es-test:9200/test-index/test-type/_search/ -d
'{"query":{"field":{"gram":"+test +another"}}}'

returns in 2ms

On Monday, March 18, 2013 11:18:38 AM UTC+2, Clinton Gormley wrote:

while both:

{:constant_score {:filter {:bool {:must [{:term {:gram "test"}} {:term
{:gram "another"}}]}}}}

and
{query: match_all} {:bool {:must [{:term {:gram "the_thing_is"}}
{:term {:gram "what_if_the"}}]}}

Can you post the full search command that you use, in curl style - from
the above I don't know what phase you are using to run these particular
clauses

clint
comes back after 23ms (average after many attempts):
{:took 23, :timed_out false, :_shards {:total 5, :successful
5, :failed 0}, :hits {:total 1, :max_score 1.0, :hits [{:_index
test-index, :_type test, :_id OG0xbcF-TEuNWCSGlSENhw, :_score
1.0, :_source {...}}]}}

how does that make sense?

On Friday, March 15, 2013 1:15:04 PM UTC+2, Clinton Gormley wrote:
    > 
    > They both give the same results, and after a couple of runs 
    each, they 
    > both return in about 1 ms. But the second query was faster 
    the first 
    > time. (Small data set makes it difficult to truly measure 
    the 
    > performance of this particular type of query). 
    
    Actually, the difference between a filtered query with 
    match_all and a 
    filter, and just a constant_score query with a filter should 
    be minimal. 
    Any differences you saw were probably due to file system 
    caching. 
    
    clint 
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Clinton_Gormley · March 18, 2013, 3:02pm

On Mon, 2013-03-18 at 07:06 -0700, Shlomi wrote:

sure, here are both queries:
curl -XGET http://es-test:9200/test-index/test-type/_search/ -d
'{"filter":{"bool":{"must":[{"term":{"gram":"test"}},{"term":{"gram":"another"}}]}},"query":{"match_all":{}}}'

Using the top-level filter param means:

return all documents in the query (match all in this case)
calculate facets, if any
then filter results

curl -XGET http://es-test:9200/test-index/test-type/_search/ -d
'{"query": {"constant_score": {"filter": {"bool": {"must": [{"term":
{"gram": "test"}},{"term": {"gram": "another"}}]}}}}}'

This looks good - I'm very surprised it is taking 20ms, as I'd expect
this to return in the 2-3 ms range

they both get me >20 ms,

while he following:
curl -XGET http://es-test:9200/test-index/test-type/_search/ -d
'{"query":{"field":{"gram":"+test +another"}}}'

Queries are fast and efficient, but the performance also depends on how
many docs match. In this case, you only have one matching doc, so they
get to show off their speed. But if you have 10 million docs which
match, all of which need to be scored, then it'd be best to apply
filters to them before scoring.

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

shlomivaknin · March 18, 2013, 3:24pm

Queries are fast and efficient, but the performance also depends on how
many docs match. In this case, you only have one matching doc, so they
get to show off their speed. But if you have 10 million docs which
match, all of which need to be scored, then it'd be best to apply
filters to them before scoring.

oh I see, that makes sense now, thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

phill · March 18, 2013, 6:45pm

On 3/18/2013 8:02 AM, Clinton Gormley wrote:

But if you have 10 million docs which match, all of which need to be
scored, then it'd be best to apply filters to them before scoring.

I'm confused on how to apply filters BEFORE queries. I always wanted to
apply filters BEFORE queries, but I don't see how to do that in the same
way that filters after queries work.

The places to put a filter are:

filter in the search request - as Clinton said, that really is the
last thing applied.
filtered query "A query that applies a filter to the results of
another query"

That sounds like post query filtering to me.
Therefore, I do NOT see where I can do filtering before scoring.

What is a normal pattern for creating queries for doing filtering BEFORE
scoring?
Is there something better than what I have below?
Wouldn't the following combine (i.e. coordination as it is called) the
score for the filtering in the 1st "must" (even if boosted or const)
with other "should"s and "must"s?
Is there any way around a pre-filtering that doesn't effect the score in
some way?
Am I overly worried about this tweaking of the score by a pre-filter
sub-expression?

Why would I want to preserve the scoring and not have it effected by the
filtering? When a user writes a search to boost a term or phrase etc.,
it seems messy to have this other pre-filter expression
going into scoring, particularly when I am also embellishing the users
query with "helpful" phrases and spans of my own based on the users input.

My best try this morning pre-filtering.
{
"bool"{
"must":[
{
"filtered":{
"query":{
"match_all":{

                     }
                 },
                 "filter":{
                     ... insert all of your pre-filtering here.
                 }
             }
         },
             ... insert all of your other "must"s here
     ],
     "should": [
         ... insert all of your "should"s here
     ]
 }

}

-Paul

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Clinton_Gormley · March 18, 2013, 7:38pm

Hi Paul

On Mon, 2013-03-18 at 11:47 -0700, P. Hill wrote:

On 3/18/2013 8:02 AM, Clinton Gormley wrote:

But if you have 10 million docs which match, all of which need to be
scored, then it'd be best to apply filters to them before scoring.

I'm confused on how to apply filters BEFORE queries. I always wanted to
apply filters BEFORE queries, but I don't see how to do that in the same
way that filters after queries work.

The places to put a filter are:

filter in the search request - as Clinton said, that really is the
last thing applied.

filtered query "A query that applies a filter to the results of
another query"

Don't believe the docs

Pre 0.90 I believe the filter and query were executed in tandem, with
both the filter and the query advancing one doc at a time (the
"leapfrog" approach).

From 0.90 onwards, you can specify a "strategy" in the filtered query,
which can be set to:
query_first
random_access_always
leap_frog
random_access_THRESHOLD
leap_frog_query_first
leap_frog_filter_first

The default is to use random access where possible, and to fall back to
leap_frog_filter_first where not.

The random_access_THRESHOLD allows you to specify an integer THRESHOLD.
Not entirely sure what happens in this case, but hopefully will be
documented forthwith

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

brian_yoder · March 18, 2013, 7:57pm

It would seem that for each indexed terms, the matching documents
(hopefully, the _id references to them) should be sub-ordered by _id.

Then, a doc-at-a-time would fetch and process a single document from each.
It would honor a minimum _id value and skip over any _id values less than
this minimum. If no matches from a must clause, the minimum _id could then
be advanced for all of the clauses. This would effectively skip huge swaths
of documents that would otherwise match each clause.

I don't know if Lucene could be readily taught to do this. But I know it
works from another "NoSQL" engine I created once. But it wasn't Lucene, and
Lucene is where the crowd hangs out!

On Monday, March 18, 2013 3:38:32 PM UTC-4, Clinton Gormley wrote:

Pre 0.90 I believe the filter and query were executed in tandem, with
both the filter and the query advancing one doc at a time (the
"leapfrog" approach).

From 0.90 onwards, you can specify a "strategy" in the filtered query,
which can be set to:
query_first
random_access_always
leap_frog
random_access_THRESHOLD
leap_frog_query_first
leap_frog_filter_first

The default is to use random access where possible, and to fall back to
leap_frog_filter_first where not.

The random_access_THRESHOLD allows you to specify an integer THRESHOLD.
Not entirely sure what happens in this case, but hopefully will be
documented forthwith

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

phill · March 18, 2013, 8:08pm

On 3/18/2013 12:38 PM, Clinton Gormley wrote:

Hi Paul

On Mon, 2013-03-18 at 11:47 -0700, P. Hill wrote:

On 3/18/2013 8:02 AM, Clinton Gormley wrote:

But if you have 10 million docs which match, all of which need to be
scored, then it'd be best to apply filters to them before scoring.
I'm confused on how to apply filters BEFORE queries. I always wanted to
apply filters BEFORE queries, but I don't see how to do that in the same
way that filters after queries work.

The places to put a filter are:

filter in the search request - as Clinton said, that really is the
last thing applied.

filtered query "A query that applies a filter to the results of
another query"
Don't believe the docs

Pre 0.90 I believe the filter and query were executed in tandem, with
both the filter and the query advancing one doc at a time (the
"leapfrog" approach).

Oh that NEEDS to be documented SO BAD!
random_access? You mention it is used, but didn't suggest what it is.
I don't have a guess.
What is wrong with (entire) filter_first, particularly combined with a
cached filter?

-Paul

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Clinton_Gormley · March 18, 2013, 8:20pm

Oh that NEEDS to be documented SO BAD!
random_access? You mention it is used, but didn't suggest what it is.
I don't have a guess.
What is wrong with (entire) filter_first, particularly combined with a
cached filter?

This may shed some more light on it:

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Clinton_Gormley · March 18, 2013, 8:22pm

On Mon, 2013-03-18 at 12:57 -0700, InquiringMind wrote:

It would seem that for each indexed terms, the matching documents
(hopefully, the _id references to them) should be sub-ordered by _id.

The main reason I gave up trying to figure out what the code does, was
that it was at odds with the comments. So I opened this instead:

github.com/elastic/elasticsearch

Filter strategy comments and code inconsistent

opened 07:37PM - 18 Mar 13 UTC

closed 08:24PM - 03 Jul 14 UTC

clintongormley

The comments in XFilteredQuery claim that the strategy will fall back to LEAP_FR…OG_FILTER FIRST https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/common/lucene/search/XFilteredQuery.java#L186 but the code refers to: - LEAP_FROG_QUERY_FIRST https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/common/lucene/search/XFilteredQuery.java#L208 and - QUERY_FIRST https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/common/lucene/search/XFilteredQuery.java#L216 is this logic correct?

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

brian_yoder · March 18, 2013, 10:11pm

Clint,

Thanks for the link and for all your patient help and valuable information.
I also found your excellent SlideShare tutorial on searching which tied
together the information from your recent posts.

One additional question: Is there a similar situation with filters in that
some types of filters expect their filter values to already be analyzed,
while others analyze them?

One additional comment: The link you provided discusses the concept of
leapfrogging which speeds the intersection of a query and a filter. Now if
Lucene could apply this to the BoolQueryBuilder in which all of the must
terms (at least) were processed by leapfrogging, then complex queries could
be much faster. For example: (city:"New Hartford" AND (state:NY OR
state:CT))

Just a thought... Probably somewhat naive based on my lack of experience
with anything deep inside Lucene.

On Monday, March 18, 2013 4:20:53 PM UTC-4, Clinton Gormley wrote:

This may shed some more light on it:

Changing Bits: Fast search filters using flex

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Clinton_Gormley · March 19, 2013, 11:29am

Hiya

One additional question: Is there a similar situation with filters in
that some types of filters expect their filter values to already be
analyzed, while others analyze them?

Filters are not analyzed.

One additional comment: The link you provided discusses the concept of
leapfrogging which speeds the intersection of a query and a filter.
Now if Lucene could apply this to the BoolQueryBuilder in which all of
the must terms (at least) were processed by leapfrogging, then complex
queries could be much faster. For example: (city:"New Hartford" AND
(state:NY OR state:CT))

Just a thought... Probably somewhat naive based on my lack of
experience with anything deep inside Lucene.

I have no idea

That said, I think where we would like to get to is to be able to
calculate the cost of individual clauses, and run the cheaper clause
first. So eg searching for "to appendiculate" would process the
"appendiculate" clause before the "to" clause.

But we're not there yet

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Migration of elasticsearch java Client from 1.4.2 to 7.2 (problem in filteredQuery method) Elasticsearch	6	1607	August 7, 2019
Java driver - Filter queries migration Elasticsearch	4	325	August 11, 2021
Question about Java Search API and building the queries Elasticsearch	1	287	June 19, 2019
Elastic Search - QueryBuilder JAVA Elasticsearch language-clients	2	246	September 13, 2022
Boolean query vs filters and more Elasticsearch	9	440	July 6, 2017

Advice needed for searching: filters vrs. queries

Related topics