Advice needed for searching: filters vrs. queries

Hiya

But the migration from queries to filters is what caught my attention.
I had already been looking at this, and have some question:

Instead of using the static QueryBuilders.boolQuery method to create a
BoolQueryBuilder, I was considering using a FilterBuilders.boolFilter
method to create a BoolFilterBuilder. It seems to have the the
andFilter, notFilter, and orFilter counterparts to the
BoolQueryBuilder's must, mustNot, and should methods. Is the only
difference between queries and filters really just scoring?

Filters are faster, because they are simpler. They don't have to do any
scoring. On top of that, most filters can be cached in a compact bitset,
making them even faster when you reuse them.

Filters don't do full text analysis, and don't do scoring (although they
can be combined with the custom_filters_score query to influence
scoring).

So yes. Use filters wherever you can.

Bool filter vs and/or/not:

The bool filter consumes bitsets. Most filters produce bitsets, eg a
filter like { term: { status: "active" }} will examine every document in
the index and create a bitset for the entire index (one bit per
document) which contains '1' if the document matches, and '0' if it
doesn't.

Combining these bitsets is very efficient.

However, certain filters (geo filters and numeric_range) don't create
bitsets. They examine each doc in turn. Running a geo-distance filter
on every doc in the index is heavy. You want to avoid that.

and/or/not filters don't demand bitsets. They work doc-by-doc, so
they're a good fit for geo filters. They also short-circuit. If a doc
has already been excluded by an earlier filter, it won't run the later
filters.

So to put it all together, combine the bitset filters with a bool
filter, and then combine the bool filter with the geo filter using an
'and' clause, with the geo-filter after the and (see example below)

Do I really need to create a QueryBuilders.matchAll query builder and
then add filters to it?

You can use a filtered query, or a constant score query:

curl -XGET 'http://127.0.0.1:9200/_all/_search?pretty=1' -d '
{
"query" : {
"constant_score" : {
"filter" : {
"and" : [
{
"bool" : {
"must" : [
{
"term" : {
"status" : "active"
}
},
{
"range" : {
"date" : {
"gte" : "2013-01-01"
}
}
}
]
}
},
{
"geo_distance" : {
"distance" : "10km",
"location" : [
0,
0
]
}
}
]
}
}
}
}
'

Of course, there doesn't seem to be a counterpart for phrase matching
in the filter query world. So when I detect a blank inside a term
string, I create a phrase match query as follows:

Correct - any part of the query that relates to "full text search" is
better off handled by queries.

But by default, I use the fieldQueryBuilder, since it automatically
recognizes strings such as A+B as a phrase, and it also recognizes
certain Chinese characters as individual words of a phrase. Very nice,
and fully compatible with values of one term or a phrase.

The field/query_string query can be useful and powerful, but is
problematic. First, formatting the query correctly can be tricky - quite
often it is not obvious that it is not running the query that you expect
(it's a complicated syntax). Second, any syntax error will just cause
the query to fail - no results. Third, you're exposing your search to
abuse by very heavy queries, eg "a* b* c* d* e* ..." etc

I think that search keywords should be parsed by your application to
allow just the queries that you specifically want to allow.

Is there some requirement or benefit to constructing a search using a
top-level QueryBuilders.matchAll and adding the complex tree of filter
builders to it? Or can I bypass the query builders? Or is phrase
matching something that makes it impossible to generically throw
either a single term or a phrase into the query (as I can easily do
with query builders).

full text search -> use queries

use filters for the kind of thing you would normally express with SQL:

WHERE id IN (1,2,3)
AND ( date >= '2013-01-01' OR featured )
AND status = 'active'

that kind of thing.

The caching isn't all that interesting: Ad-hoc queries that are
complex vary widely, and are rarely the same from call to call and
across clients. So once the search engine is warmed up, the non-cached
steady state response times are the most interesting. (For example, a
cached query can return in a few milliseconds, but the first instance
of that query took 35 seconds and it only returned 2 matches across
those 78M documents.)

Queries aren't cached, but a query is faster once the data required to
run the query is loaded into the kernel filesystem caches (which I think
is what you mean). Having lots of kernel cache space is good for
performance.

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.