Speed of query with many filters


(Michael Korbakov) #1

Hi everybody.

We have big index "contacts" which size is about 3.5Gb (I mean "primary_size") with 6,448,782 documents. We have performance problems with particular queries. Their execution time is >5 seconds.

Our index configuration is 2 replicas, 3 shards. It's run on 3 EC2 m1.large servers.
All fields in index document are not analyzed, _source and _all are disabled.
Document has 2 big fields: "fields" and "reverse_fields". Last one is reverse version of "fields", it is designed to match words from the end.

Here is part of "fields": mapping:
...
"fields": {
"type": "object",
"dynamic" : False,
"properties" : {
.....
"city": {
"type": "string",
"index": "not_analyzed",
"omit_term_freq_and_positions": "true"
},
"state": {
"type": "string",
"index": "not_analyzed",
"omit_term_freq_and_positions": "true"
},
"zip": {
"type": "string",
"index": "not_analyzed",
"omit_term_freq_and_positions": "true"
},
....
}
}
.....
}

We looking for ways to speed up our "contain" query over all document fields. First 2 filter terms (company_id and is_visible) match ~43k documents.
Example of the query:

es_q = {'sort':
[{'_score': 'desc'}], 'query': {'filtered': {'filter': {'and': {
'filters': [{'term': {'company_id': '4b619ddffa5bd81b71000002'}}, {'term': {'is_visible': True}}, {
'or': [{'prefix': {'fields.last_name': 'alexander'}}, {'prefix': {'reverse_fields.last_name': 'rednaxela'}},
{'prefix': {'fields.twitter.profile': 'alexander'}},
{'prefix': {'reverse_fields.twitter.profile': 'rednaxela'}},
{'prefix': {'fields.twitter.user_name': 'alexander'}},
{'prefix': {'reverse_fields.twitter.user_name': 'rednaxela'}},
{'prefix': {'fields.twitter.user_id': 'Alexander'}},
{'prefix': {'reverse_fields.twitter.user_id': 'rednaxelA'}},
{'prefix': {'fields.linkedin.profile': 'alexander'}},
{'prefix': {'reverse_fields.linkedin.profile': 'rednaxela'}},
{'prefix': {'fields.linkedin.user_name': 'alexander'}},
{'prefix': {'reverse_fields.linkedin.user_name': 'rednaxela'}},
{'prefix': {'fields.linkedin.user_id': 'Alexander'}},
{'prefix': {'reverse_fields.linkedin.user_id': 'rednaxelA'}}, {'prefix': {'fields.street': 'alexander'}}
, {'prefix': {'reverse_fields.street': 'rednaxela'}},
{'prefix': {'fields.skype.profile': 'alexander'}},
{'prefix': {'reverse_fields.skype.profile': 'rednaxela'}},
{'prefix': {'fields.skype.user_name': 'alexander'}},
{'prefix': {'reverse_fields.skype.user_name': 'rednaxela'}},
{'prefix': {'fields.skype.user_id': 'Alexander'}},
{'prefix': {'reverse_fields.skype.user_id': 'rednaxelA'}}, {'prefix': {'fields.city': 'alexander'}},
{'prefix': {'reverse_fields.city': 'rednaxela'}}, {'prefix': {'fields.first_name': 'alexander'}},
{'prefix': {'reverse_fields.first_name': 'rednaxela'}}, {'prefix': {'fields.zip': 'alexander'}},
{'prefix': {'reverse_fields.zip': 'rednaxela'}}, {'prefix': {'fields.title': 'alexander'}},
{'prefix': {'reverse_fields.title': 'rednaxela'}}, {'prefix': {'fields.state': 'alexander'}},
{'prefix': {'reverse_fields.state': 'rednaxela'}}, {'prefix': {'fields.leadSource': 'alexander'}},
{'prefix': {'reverse_fields.leadSource': 'rednaxela'}}, {'prefix': {'fields.company_name': 'alexander'}}
, {'prefix': {'reverse_fields.company_name': 'rednaxela'}},
{'prefix': {'fields.department': 'alexander'}}, {'prefix': {'reverse_fields.department': 'rednaxela'}},
{'prefix': {'fields.email.profile': 'alexander'}}, {'prefix':
{
'reverse_fields.email.profile': 'rednaxela'}}
, {'prefix': {'fields.email.user_name': 'alexander'}},
{'prefix': {'reverse_fields.email.user_name': 'rednaxela'}},
{'prefix': {'fields.email.user_id': 'Alexander'}},
{'prefix': {'reverse_fields.email.user_id': 'rednaxelA'}},
{'prefix': {'fields.website': 'alexander'}}, {'prefix': {'reverse_fields.website': 'rednaxela'}}, {
'prefix': {'fields.description': 'alexander'}}, {'prefix': {'reverse_fields.description': 'rednaxela'}},
{
'prefix': {'fields.accountNumber': 'alexander'}},
{'prefix': {'reverse_fields.accountNumber': 'rednaxela'}}, {
'prefix': {'fields.assistant': 'alexander'}}, {'prefix': {'reverse_fields.assistant': 'rednaxela'}}, {
'prefix': {'fields.phone': 'alexander'}}, {'prefix': {'reverse_fields.phone': 'rednaxela'}}, {
'prefix': {'fields.facebook.profile': 'alexander'}},
{'prefix': {'reverse_fields.facebook.profile': 'rednaxela'}}, {
'prefix': {'fields.facebook.user_name': 'alexander'}}, {
'prefix': {'reverse_fields.facebook.user_name': 'rednaxela'}}, {
'prefix': {'fields.facebook.user_id': 'Alexander'}},
{'prefix': {'reverse_fields.facebook.user_id': 'rednaxelA'}}, {
'prefix': {'fields.leadType': 'alexander'}}, {'prefix': {'reverse_fields.leadType': 'rednaxela'}}, {
'prefix': {'fields.dates': 'alexander'}}, {'prefix': {'reverse_fields.dates': 'rednaxela'}}, {
'prefix': {'fields.name': 'alexander'}}, {'prefix': {'reverse_fields.name': 'rednaxela'}}, {
'prefix': {'fields.country': 'alexander'}}, {'prefix': {'reverse_fields.country': 'rednaxela'}}, {
'prefix': {'fields.assistantPhone': 'alexander'}}, {
'prefix': {'reverse_fields.assistantPhone': 'rednaxela'}}]}]}}, 'query': {'bool': {'must': [{
'dis_max': {'tie_breaker': 0.7,
'queries': [{'constant_score': {'filter': {'term': {'fields.first_name': 'alexander'}}, 'boost': 20.0}},
{'constant_score': {'filter': {'prefix': {'fields.first_name': 'alexander'}}, 'boost': 11.0}}, {
'constant_score': {'filter': {'prefix': {'reverse_fields.first_name': 'rednaxela'}},
'boost': 11.0}},
{'constant_score': {'filter': {'term': {'fields.last_name': 'alexander'}}, 'boost': 40.0}},
{'constant_score': {'filter': {'prefix': {'fields.last_name': 'alexander'}}, 'boost': 17.0}}, {
'constant_score': {'filter': {'prefix': {'reverse_fields.last_name': 'rednaxela'}},
'boost': 17.0}},
{'constant_score': {'filter': {'term': {'fields.name': 'alexander'}}, 'boost': 18.0}},
{'constant_score': {'filter': {'prefix': {'fields.name': 'alexander'}}, 'boost': 5.0}},
{'constant_score': {'filter': {'prefix': {'reverse_fields.name': 'rednaxela'}}, 'boost': 5.0}},
{'match_all': {}}]}}], 'should': []}}}}, 'explain': False}

Looking for any help :slight_smile:

-- Michael Korbakov


(Clinton Gormley) #2

Hi Michael

            {'prefix': {'fields.twitter.profile': 'alexander'}},
            {'prefix': {'reverse_fields.twitter.profile': 'rednaxela'}},
            {'prefix': {'fields.twitter.user_name': 'alexander'}},
            {'prefix': {'reverse_fields.twitter.user_name':

'rednaxela'}},

Prefix filters can be expensive - first they have to find all terms that
begin with your prefix, then add separate clauses for each of those
terms.

I think what you're looking for would be more efficiently achieved using
edge ngrams

http://www.elasticsearch.org/guide/reference/index-modules/analysis/edgengram-tokenizer.html

I've gisted an example of how you would create your index, and search
for your data:

clint


(Michael Korbakov) #3

Thank you!

We're going to try it today. I'm concerned a little about general
filter/query performance. I was under impression that any query will
be slower then any filter. I guess it isn't the case here :). BTW, is
it beneficial to wrap these query_string into query filters?

-- Michael Korbakov

On Tue, Jun 21, 2011 at 3:07 AM, Clinton Gormley [via ElasticSearch
Users] ml-node+3090039-1179935648-83923@n3.nabble.com wrote:

Hi Michael

            {'prefix': {'fields.twitter.profile': 'alexander'}},
            {'prefix': {'reverse_fields.twitter.profile':

'rednaxela'}},
{'prefix': {'fields.twitter.user_name': 'alexander'}},
{'prefix': {'reverse_fields.twitter.user_name':
'rednaxela'}},

Prefix filters can be expensive - first they have to find all terms that
begin with your prefix, then add separate clauses for each of those
terms.

I think what you're looking for would be more efficiently achieved using
edge ngrams

http://www.elasticsearch.org/guide/reference/index-modules/analysis/edgengram-tokenizer.html

I've gisted an example of how you would create your index, and search
for your data:

https://gist.github.com/1037563

clint


If you reply to this email, your message will be added to the discussion
below:
http://elasticsearch-users.115913.n3.nabble.com/Speed-of-query-with-many-filters-tp3088603p3090039.html
To unsubscribe from Speed of query with many filters, click here.


(Clinton Gormley) #4

Hiya

We're going to try it today. I'm concerned a little about general
filter/query performance. I was under impression that any query will
be slower then any filter. I guess it isn't the case here :).

Well, queries have a second phase that filters don't have: calculating
the score/relevance, so yes, they generally don't perform as well.
Also, filters can be cached, while queries can't.

BTW, is
it beneficial to wrap these query_string into query filters?

Not sure what you mean here.

One thing I should have thought of yesterday was that this might not do
exactly what you want. In your example, you were searching for
'alexander' in its entirety.

Because I set the 'analyzer' for those fields to use edge-ngrams, it
uses them both at index time and at search time.

So a search for 'alexander' against the twitter profile field actually
becomes a search for a|al|ale|alex|etc

Two options here:

  1. you can set the index_analyzer to "left"|"right" and the
    search_analyzer to "default"

  2. you can use a term filter eg:
    { and: [
    { term: { "twitter.profile": "alexander" }},
    { term: { "twitter.profile.reverse_profile": "alexander" }}
    ]}

Option (2) will be faster, but option(1) has the advantage that it
handles the analysis of the search term for you.

For instance, "foo-bar" would actually be broken down into "foo" "bar",
but if you do a term query for "foo-bar" it won't be found, it doesn't
exist.

So if you are sure that, in your app, you are converting the query text
(alexander) into the correct terms that are stored in ES, then method 2
would be preferred. If not, then it may be better to rely on the query
instead.

clint


(Michael Korbakov) #5
> BTW, is > it beneficial to wrap these query_string into query filters? Not sure what you mean here. I was meaning this filter: http://www.elasticsearch.org/guide/reference/query-dsl/query-filter.html One thing I should have thought of yesterday was that this might not do exactly what you want. In your example, you were searching for 'alexander' in its entirety.

Because I set the 'analyzer' for those fields to use edge-ngrams, it
uses them both at index time and at search time.

So a search for 'alexander' against the twitter profile field actually
becomes a search for a|al|ale|alex|etc

Two options here:

  1. you can set the index_analyzer to "left"|"right" and the
    search_analyzer to "default"

  2. you can use a term filter eg:
    { and: [
    { term: { "twitter.profile": "alexander" }},
    { term: { "twitter.profile.reverse_profile": "alexander" }}
    ]}

Option (2) will be faster, but option(1) has the advantage that it
handles the analysis of the search term for you.
We're stopped on option (2). However this prefix substitution doesn't shown any significant speed improvement. We're still getting timeouts. Now we trying to combine all fields we're searching by into single one (pseudo _all) and run search by it. Hope that will help.


(Clinton Gormley) #6

On Wed, 2011-06-22 at 22:16 -0700, Michael Korbakov wrote:

Clinton Gormley wrote:

BTW, is
it beneficial to wrap these query_string into query filters?
Not sure what you mean here.
I was meaning this filter:
http://www.elasticsearch.org/guide/reference/query-dsl/query-filter.html

I don't know, to be honest. Not sure if wrapping a query in a
query-filter disables the _score calculation phase or not.

  1. you can use a term filter eg:
    { and: [
    { term: { "twitter.profile": "alexander" }},
    { term: { "twitter.profile.reverse_profile": "alexander" }}
    ]}

Option (2) will be faster, but option(1) has the advantage that it
handles the analysis of the search term for you.

We're stopped on option (2). However this prefix substitution doesn't
shown any significant speed improvement. We're still getting timeouts. Now
we trying to combine all fields we're searching by into single one (pseudo
_all) and run search by it. Hope that will help.

By "stopped" do you mean you're using option 2, or you have decided
against using option 2?

Filters should be fast, even if there are many of them. However, you
need to have enough memory to hold all of the terms, and the initial
query will be slow as it needs to load all of those terms the first
time.

After the first run, it should be significantly faster.

clint


(system) #7