Speed up filtered multi_match bool phrase query


#1

Hi guys,

I'm trying to speed up a complex query and currently I have the following:

{ "query" : { "filtered": { "query": { "bool": { "should": [ {"multi_match": { "query": "term1", "type":"phrase","fields": [ "field1", "field2", "field3", "field4", "field5", "field6" ]}}, {"multi_match": { "query": "term2", "type":"phrase","fields": [ "field1", "field2", "field3", "field4", "field5", "field6" ]}},indent preformatted text by 4 spaces {"multi_match": { "query": "term3", "type":"phrase","fields": [ "field1", "field2", "field3", "field4", "field5", "field6" ]}}, {"multi_match": { "query": "term4", "type":"phrase","fields": [ "field1", "field2", "field3", "field4", "field5", "field6" ]}}, {"multi_match": { "query": "term5", "type":"phrase","fields": [ "field1", "field2", "field3", "field4", "field5", "field6" ]}}, {"multi_match": { "query": "term6", "type":"phrase","fields": [ "field1", "field2", "field3", "field4", "field5", "field6" ]}}, {"multi_match": { "query": "term7", "type":"phrase","fields": [ "field1", "field2", "field3", "field4", "field5", "field6" ]}} ] } }, "filter" : { "range" : { "createDate" : { "gte":"2015-01-01 00:00:01", "lte":"2015-01-05 23:59:59", "format": "yyyy-MM-dd HH:mm:ss" } } } } } }

It's really slow. I have 15m documents and field1, field2... field6 is analyzed.

Note that if I run the following:
{ "query" : { "multi_match": { "query": "term", "type": "phrase", "fields": [ "field1", "field2", "field3", "field4", "field5", "field6" ] } }, "filter" : { "range" : { "createDate" : { "gte":"2015-01-01 00:00:01", "lte":"2015-01-05 23:59:59", "format": "yyyy-MM-dd HH:mm:ss" } } } } }

it's quite fast. I know that the two queries are not the same but I have to search in several analyzed fields based on several keywords. How can I speed up the first query?


(Tanguy) #2

Are you searching for only 1 term in every multi_match query?


#3

Well, I started to figure out how to speed up and currently it seems that in one search I have to search for phrases and for single words also, so a typical query would be something like this:
{ "query" : { "filtered": { "query": { "bool": { "should": [ {"multi_match": { "query": "phrase1", "type":"phrase","fields": [ "field1", "field2", "field3", "field4", "field5", "field6" ]}}, {"multi_match": { "query": "phrase2", "type":"phrase","fields": [ "field1", "field2", "field3", "field4", "field5", "field6" ]}}, {"multi_match": { "query": "word1 word2 word3", "type":"most_fields","fields": [ "field1", "field2", "field3", "field4", "field5", "field6" ]}}, ] } }, "filter" : { "range" : { "createDate" : { "gte":"2015-01-01 00:00:01", "lte":"2015-01-05 23:59:59", "format": "yyyy-MM-dd HH:mm:ss" } } } } } }

The number of single words and phrases is dynamic so in worst case it can be a few tens of different phrases.


#4

I did some improvements so I started to index my data with single terms and shingles (min 2, max 3), with the following settings and mapping:
{ "myIndex": { "mappings": { "myType": { "properties": { "_id": { "type": "string", "store": true }, "createDate": { "type": "date", "format": "yyyy-MM-dd HH:mm:ss" }, "field1": { "type": "string", "index": "not_analyzed" }, "field2": { "type": "string", "analyzer": "hu", "fields": { "shingles": { "type": "string", "analyzer": "hu" } } }, "field3": { "type": "string", "analyzer": "hu", "fields": { "shingles": { "type": "string", "analyzer": "hu" } } }, "field4": { "type": "string", "index": "not_analyzed" } } } }, "settings": { "index": { "analysis": { "filter": { "hu_HU": { "type": "hunspell", "locale": "hu_HU", "language": "hu_HU" }, "shingle": { "type": "shingle", "min_shingle_size": "2", "max_shingle_size": "3", "output_unigrams": "true" } }, "analyzer": { "hu": { "filter": [ "lowercase", "hu_HU", "shingle" ], "tokenizer": "standard" } } } } } } }

In this case, based on this article I changed my query as the following:

{ "query" : { "filtered": { "query": { "bool": { "should": [ {"multi_match": { "query": "my phrase1", "type":"most_fields","fields": [ "field2.shingles", "field3.shingles" ]}}, {"multi_match": { "query": "my phrase2", "type":"most_fields","fields": [ "field2.shingles", "field3.shingles" ]}}, {"multi_match": { "query": "word1 word2 word3", "type":"most_fields","fields": [ "field2", "field3" ]}}, ] } }, "filter" : { "range" : { "createDate" : { "gte":"2015-01-01 00:00:01", "lte":"2015-01-05 23:59:59", "format": "yyyy-MM-dd HH:mm:ss" } } } } } }

My question is: can it be more efficient? Does this query make sense in point of query optimization?


(system) #5