No noticeable change in "filtered" response time - only when using simple query

Consider the following simple query

{
"query":{
"match_all":{

  }

},
"filter":{
"bool":{
"must":[
{
"terms":{
"gender":[
"m"
]
}
}
]
}
},
"sort":[
{
"sub":{
"order":"desc"
}
}
],
"from":10,
"size":10
}

This was taking 800ms when run on 50 million records. I tried speeding
things up using "filtered", but the response time remains the same:

{
"query":{
"filtered":{
"query":{
"match_all":{

        }
     },
     "filter":{
        "bool":{
           "must":[
              {
                 "terms":{
                    "gender":[
                       "m"
                    ]
                 }
              }
           ]
        }
     }
  }

},
"sort":[
{
"sub":{
"order":"desc"
}
}
],
"from":10,
"size":10
}

Note that in these two queries, the "must" parameter is:

[{"terms":{"gender":["m","f",""]}}]

If I increase the "must" parameters to

[
{"term":{"sr_loc":"1"}},
{"range":{"birth_es_date":{"from":"19770101","to":"19970527"}}},
{"term":{"loc":"SA"}},
{"terms":{"gender":["f"]}}
]

Then there is a huge difference between the before and after "filtered"
optimization (drops from 800ms to 30).

Is it because the simpler "must" parameter returns a much larger result set
which cannot be cached?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi

So first, the top level "filter" parameter should only be used when you are
wanting to facet on unfiltered results, but filter the search results. At
all other times, you should use a "filtered" query instead.

That said, your filter on gender probably matches about 25 million
documents, which you are then sorting. I'm guessing that the sort is
taking a disproportional amount of time.

Normally, a filtered query will try to apply the filters before running the
query. In your example where your filter matches lots of documents, if you
were to combine that with a simple query (eg { match: { name: "ezekiel"}})
then this query may actually be faster than the filter, as "ezekiel" is
likely to appear in far fewer documents than gender "m". The filtered
query does try to detect these anomalies, but this can also be controlled
by the undocumented "strategy" parameter.

However, queries usually come from users, and it is difficult to know in
advance if they are going to be simple (and fast) or complex (and slower)
queries. Using the default strategy for filtered at least gives you some
consistency in response times, by reducing the total number of docs that
the query has to examine.

Your example where you include several must clauses is doing just that -
reducing the total number of docs that the query needs to examine by a much
larger percentage than your first query.

Note: all cached filters are the same size. it doesn't depend on how many
docs match or not. It uses a bitset to represent every doc in the index,
with each bit set to either 1 or 0

Clint

On 10 July 2013 20:56, Martin Konecny martin.konecny@gmail.com wrote:

Consider the following simple query

{
"query":{
"match_all":{

  }

},
"filter":{
"bool":{
"must":[
{
"terms":{
"gender":[
"m"
]
}
}
]
}
},
"sort":[
{
"sub":{
"order":"desc"
}
}
],
"from":10,
"size":10
}

This was taking 800ms when run on 50 million records. I tried speeding
things up using "filtered", but the response time remains the same:

{
"query":{
"filtered":{
"query":{
"match_all":{

        }
     },
     "filter":{
        "bool":{
           "must":[
              {
                 "terms":{
                    "gender":[
                       "m"
                    ]
                 }
              }
           ]
        }
     }
  }

},
"sort":[
{
"sub":{
"order":"desc"
}
}
],
"from":10,
"size":10
}

Note that in these two queries, the "must" parameter is:

[{"terms":{"gender":["m","f",""]}}]

If I increase the "must" parameters to

[
{"term":{"sr_loc":"1"}},
{"range":{"birth_es_date":{"from":"19770101","to":"19970527"}}},
{"term":{"loc":"SA"}},
{"terms":{"gender":["f"]}}
]

Then there is a huge difference between the before and after "filtered"
optimization (drops from 800ms to 30).

Is it because the simpler "must" parameter returns a much larger result
set which cannot be cached?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

There actually isn't any "text searches" being done by the user. Hence the
match_all:{} parameter - we are simply providing a UI for the user to
filter all users based on some criteria - no text input for searching.

Is there some optimization I could do in this case? I understand what you
said that the filtered result set is very large, and therefore the query
takes a long time on it, but in this case, elasticsearch should not be
apply a query to the result set (there is no query!).

M

On Wednesday, 10 July 2013 15:39:09 UTC-4, Clinton Gormley wrote:

Hi

So first, the top level "filter" parameter should only be used when you
are wanting to facet on unfiltered results, but filter the search results.
At all other times, you should use a "filtered" query instead.

That said, your filter on gender probably matches about 25 million
documents, which you are then sorting. I'm guessing that the sort is
taking a disproportional amount of time.

Normally, a filtered query will try to apply the filters before running
the query. In your example where your filter matches lots of documents, if
you were to combine that with a simple query (eg { match: { name:
"ezekiel"}}) then this query may actually be faster than the filter, as
"ezekiel" is likely to appear in far fewer documents than gender "m". The
filtered query does try to detect these anomalies, but this can also be
controlled by the undocumented "strategy" parameter.

However, queries usually come from users, and it is difficult to know in
advance if they are going to be simple (and fast) or complex (and slower)
queries. Using the default strategy for filtered at least gives you some
consistency in response times, by reducing the total number of docs that
the query has to examine.

Your example where you include several must clauses is doing just that -
reducing the total number of docs that the query needs to examine by a much
larger percentage than your first query.

Note: all cached filters are the same size. it doesn't depend on how many
docs match or not. It uses a bitset to represent every doc in the index,
with each bit set to either 1 or 0

Clint

On 10 July 2013 20:56, Martin Konecny <martin....@gmail.com <javascript:>>wrote:

Consider the following simple query

{
"query":{
"match_all":{

  }

},
"filter":{
"bool":{
"must":[
{
"terms":{
"gender":[
"m"
]
}
}
]
}
},
"sort":[
{
"sub":{
"order":"desc"
}
}
],
"from":10,
"size":10
}

This was taking 800ms when run on 50 million records. I tried speeding
things up using "filtered", but the response time remains the same:

{
"query":{
"filtered":{
"query":{
"match_all":{

        }
     },
     "filter":{
        "bool":{
           "must":[
              {
                 "terms":{
                    "gender":[
                       "m"
                    ]
                 }
              }
           ]
        }
     }
  }

},
"sort":[
{
"sub":{
"order":"desc"
}
}
],
"from":10,
"size":10
}

Note that in these two queries, the "must" parameter is:

[{"terms":{"gender":["m","f",""]}}]

If I increase the "must" parameters to

[
{"term":{"sr_loc":"1"}},
{"range":{"birth_es_date":{"from":"19770101","to":"19970527"}}},
{"term":{"loc":"SA"}},
{"terms":{"gender":["f"]}}
]

Then there is a huge difference between the before and after "filtered"
optimization (drops from 800ms to 30).

Is it because the simpler "must" parameter returns a much larger result
set which cannot be cached?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Martin

On 10 July 2013 22:07, Martin Konecny martin.konecny@gmail.com wrote:

There actually isn't any "text searches" being done by the user. Hence the
match_all:{} parameter - we are simply providing a UI for the user to
filter all users based on some criteria - no text input for searching.

Is there some optimization I could do in this case? I understand what you
said that the filtered result set is very large, and therefore the query
takes a long time on it, but in this case, elasticsearch should not be
apply a query to the result set (there is no query!).

Sure - the match_all query is optimized for such cases already. So
essentially just the filter is being applied. As I said, I reckon that
most of the time is being consumed by sorting 25 million documents.

In order to optimize this, you want to reduce the number of documents that
match. You're sorting on the "sub" field. I have no idea what this field
contains but I'll pretend that it has values 1..100. If you know that you
have eg more than 1000 results where gender=m and sub > 90, then just add
in another filter on sub, eg:

{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{ "range": { "sub": { "gte": 90 }}},
{ "term": { "gender": "m" }}
]
}
}
}
},
"sort": { "sub": { "order": "desc"}}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Clinton,

On Wed, Jul 10, 2013 at 12:39 PM, Clinton Gormley clint@traveljury.comwrote:

So first, the top level "filter" parameter should only be used when you
are wanting to facet on unfiltered results, but filter the search results.
At all other times, you should use a "filtered" query instead.

I am curious about your statement regarding facets on unfiltered results.
In the past, I have seen no result differences with facets using
query+filter versus filtered queries. What differences should occur?
Ultimately I use facet filters (using the same filter) so the type of query
shouldn't matter, but I haven't done much testing. Wondering if there is
something that I might have missed.

Ivan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I just ran some tests, and there is a difference between queries. It has
been a while since I have compared different queries. Might have to change
my logic around, thanks to Martin for starting this conversation.

--
Ivan

On Thu, Jul 11, 2013 at 12:03 PM, Ivan Brusic ivan@brusic.com wrote:

Hi Clinton,

On Wed, Jul 10, 2013 at 12:39 PM, Clinton Gormley clint@traveljury.comwrote:

So first, the top level "filter" parameter should only be used when you
are wanting to facet on unfiltered results, but filter the search results.
At all other times, you should use a "filtered" query instead.

I am curious about your statement regarding facets on unfiltered results.
In the past, I have seen no result differences with facets using
query+filter versus filtered queries. What differences should occur?
Ultimately I use facet filters (using the same filter) so the type of query
shouldn't matter, but I haven't done much testing. Wondering if there is
something that I might have missed.

Ivan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.