Elasticsearch top_hits performance using shingle filter
We're using Elasticsearch to return distinct search term suggestions from roughly a dozen different fields across a fairly large set of data. To accomplish this, we're currently using 'terms' and 'top_hits' aggregations (the terms aggregation uses a wildcard term). We're also using a shingle_filter (min-size:2, max-size:3) on a custom analyzer, as one of the requirements of the project is to return search suggestions on multi-word search terms.
I've tried several different approaches, none of which are very performant.
Approach 1 - Suggestion Criteria in _all
All criteria on which we want to return suggestions are put in the _all field which utilizes a custom analyzer with the shingle filter:
'settings' : {
'analysis' : {
'analyzer' : {
'autocomplete_analyzer' : {
'type' : 'custom',
'tokenizer' : 'suggestion_tokenizer',
'filter' : [
'lowercase',
'shingle_filter'
]
},
},
'tokenizer' : {
'suggestion_tokenizer' : {
'type' : 'whitespace'
}
},
'filter' : {
'shingle_filter' : {
'type' : 'shingle',
'min_shingle_size' : 2,
'max_shingle_size' : 3
}
}
}
},
'mappings' : {
'core' : {
'_all' : {
'enabled' : 'yes',
'index' : 'analyzed',
'analyzer' : 'autocomplete_analyzer'
},
'properties' : {
'suggestion_criteria_1': {
'type' : 'multi_field',
'fields' : {
'analyzed' : {
'type' : 'string',
'index' : 'analyzed'
},
'suggestion_criteria_1': {
'type' : 'string',
'index' : 'not_analyzed',
'include_in_all' : 'yes'
}
}
},...
'filter_criteria_1': {
'type' : 'string',
'include_in_all' : 'no',
'index' : 'not_analyzed'
},...
}
}
}
Aggregation/Query utilzies filters and a suggestion term search array, as we need to know which field the suggestion match came from:
{
'from' : 0,
'size' : 0,
'query' : {
'filtered' : {
'filter' : {
'and' : [
{search filter array / optional}
]
}
}
},
'aggs' : {
'suggestions' : {
'terms' : {
'field' : '_all',
'include' : '.*{search_term}.*'
},
'aggs' : {
'field_matches' : {
'top_hits' : {
'_source' : {
'include' : {criteria_array}
},
'size' : 1
}
}
}
}
}
};
After the filters are applied, we're dealing with a set of about 100k documents, and the result comes back in over 500ms, which is far longer than ideal given search suggestions need to occur on every keystroke.
Approach 2 - Include All Suggestion Criteria in Aggregation / Drop _all
For brevity, I'll just describe the changes to the index structure and query/aggregation above.
I disabled the _all field and instead applied the "autocomplete_analyzer" (which includes the shingle_filter) to each of the suggestion criteria themselves (of which there are about a dozen) in the mapping.
All suggestion terms were then added to the query/aggregation...
'aggs' : {
'suggestion_term_1' : {
'terms' : {
'field' : 'suggestion_term_1',
'include' : '.*{search_term}.*'
},
'aggs' : {
'field_matches' : {
'top_hits' : {
'_source' : {
'include' : 'suggestion_term_1'
},
'size' : 1
}
}
}
},
'suggestion_term_2' : {
'terms' : {
'field' : 'suggestion_term_2',
'include' : '.*{search_term}.*'
},
'aggs' : {
'field_matches' : {
'top_hits' : {
'_source' : {
'include' : 'suggestion_term_2'
},
'size' : 1
}
}
}
},
etc...
}
};
This also performs at over 500ms once filters were applied. Still not ideal.
Approach 3 - Perform Multiple Elastic Search Queries - Iterating Over Criteria
This is similar to approach 2, but instead of including all suggestion terms in the aggregation, I only include one in the request. I then iterate over the suggestions terms and perform multiple Elastic Search aggregation requests for each of the dozen criteria.
Most of the results came back in 20-30ms or so, but when summed over the entire iteration we're still north of 300-400ms in total request time.
###Edge nGrams###
I should note that as an alternative to using the wildcard search term, I tried to apply an Edge nGram filter to the analyzer as well. That, however, typically increased the total response time by 50-70% and ballooned the index size for no apparent performance benefit, so I opted to stick with the wildcard approach.
###Removing Shingle Filter###
I should also note that I see dramatic performance improvements when I remove the shingle filter, but unfortunately multi-word queries are a requirement for the project.
I suspect there may be an approach or two that I've not yet tried that will get us to signficantly improved performance times, but at this point I'm basically grasping at straws. Any suggestions would be greatly appreciated.