Why range filter in Elasticsearch takes much more CPU then full text search?


#1

I have a Filtred Query.

I first case I user full text search:

{ 'query': {'filtered': {'filter': {'bool': {'must': [{'term': {'status': 4}}]}},
                    'query': {'bool': {'should': [{'match': {'name': {'operator': 'and',
                                                                      'query': 'dog'}}},
                                                  {'match': {'description': {'boost': 0.9,
                                                                             'operator': 'and',
                                                                             'query': 'dog'}}},
                                                  {'match': {'author': {'boost': 0.8,
                                                                        'operator': 'and',
                                                                        'query': 'dog'}}},
                                                  {'match': {'tags': {'boost': 0.7,
                                                                      'operator': 'and',
                                                                      'query': 'dog'}}}]}}

In second case I use range filter:

{'query': {'filtered': {'filter': {'bool': {'must': [{'term': {'status': 4}},
                                                 {'range': {'year_to': {'lte': '1946'}}}]}}}}

I was very surprised, because second request takes 2 times more CPU than the first one;

What is going on?

My mapping:

"properties": {
"id": {
    "type": "integer"
},
"name": {
    "analyzer": "russian_morphology",
    "type": "string"
},
"description": {
    "analyzer": "russian_morphology",
    "type": "string"
},
"status": {
    "type": "integer"
},
"tags": {
    "analyzer": "russian_morphology",
    "type": "string"
},
"year_from": {
    "type": "integer"
},
"year_to": {
    "type": "integer"
}

(Nik Everett) #2

The first query is ultimately translated into a boolean combination of 5 term queries which can use the terms dictionary to jump directly to the documents that they need. The term queries are fast and the bool queries are fast.

The second query is ultimately translated into a term query (fast again) and a numeric range query. The numeric range query has to walk the terms dictionary to find its matches. Part of the work that it does is proportional to the number of distinct values less than or equal to 1946. I don't know if that is the term that dominates the runtime - it could be that there are lots of hits lte 1946. You could certainly try and take stack traces to figure out what exactly is up here - just spam the query with ab and then use jstack and look for stuff like TermRangeFilter (or TermRangeQuery post 2.0). But if you are just looking for an intuitive explanation of why complex looking queries are can be faster - what I wrote above might be good enough.


#3

Thanks for reply.

So seems {'range':{'lte':1946}} transforms into terms filters with about 80 values.

It is possible to rewrite second query to make it faster? I did try "numeric_range" filter, but perfomance seems the same.


(Nik Everett) #4

I suppose it depends on your mapping - numeric_range should automatically kick in for numbers. There is a precision_step you can play with but I don't know much about it. What kind of performance are you seeing - like how long is the query taking and how many documents is it hitting? Beyond that I'm not sure I can be much help. Usually this is where I'd break out ab and jstack to figure out what is going on.


#5

mapping for this field is integer.
Query takes 20 ms.
I have 60k documents in index. And I have 20 documents in response. Response "total" attribute 35k. Unfortunately I don't know much about java and jstack.


(system) #6