Here's the problem, for search terms 'cats' and 'cat', I am trying to get similar results i.e. 'cats' should search for 'cat' internally. To solve this minimal_english
seems like a good choice, and am using that as part of other settings (see below),
$params['body']['settings'] = [
'analysis' => [
'analyzer' => [
'shingle_analyzer' => [
'tokenizer' => 'standard',
'filter' => ['standard', 'lowercase', 'filter_stop', 'filter_shingle']
],
'ngram_analyzer' => [
'tokenizer' => 'ngram_tokenizer',
'filter' => ['standard', 'lowercase']
],
'stemmer_analyzer' => [
'tokenizer' => 'standard',
'filter' => ['standard', 'lowercase', 'filter_english_stemmer']
]
],
'tokenizer' => [
'ngram_tokenizer' => [
'type' => 'edge_ngram',
'min_gram' => 3,
'max_gram' => 10,
'token_chars' => ['letter', 'digit']
]
],
'filter' => [
'filter_stop' => [
'type' => 'stop'
],
'filter_shingle' => [
'type' => 'shingle',
'min_shingle_size' => 2,
'max_shingle_size' => 3,
'output_unigrams' => true,
'filler_token' => ''
],
'filter_english_stemmer' => [
'type' => 'stemmer',
'name' => 'minimal_english'
]
]
]
];
Here's the query being built,
"query" => [
"bool" => [
"must" => [
"multi_match" => [
"query" => $queryTerm,
'type' => 'most_fields',
'fields' => [
'animal.name.shinglefield',
'animal.name.ngramfield',
],
'analyzer' => 'stemmer_analyzer'
]
],
]
],
I tried 3 different scenarios,
-
filter_english_stemmer
as part of filter inshingle_analyzer
- With this in place, a search for 'cats' still returns matches for 'cats' based on ngrams which makes sense since ngrams doesn't havefilter_english_stemmer
filter. - To overcome previous issue, I placed
filter_english_stemmer
as part ofngram_analyzer
filter. But doing so resulted in no matches for 'cats'. Again, this makes sense since there's no shingle/ngram for either 'cats' or 'ats'. - As an alternative approach (and with settings shared above), I used
stemmer_analyzer
as part ofmulti_match
query (analyzer => 'stemmer_analyzer'
) and that query gave me similar results for 'cats' and 'cat'.
Now, my question is even though the last approach works but is it a good idea to use analyzer at query time? Is there a better way out?
Also, correct me if I am wrong but the stemmer_analyzer
used in multi_match
query works since the query 'cats' is reduced to 'cat' and then field specific analyzers are run? Is that how its working?