I'm trying to make a soft
prefix-search within n fields. Also the distance between tokens is a must. so I've decided to use edge_ngrams with a bool
query. But as far the tokens are edge_ngrams
the slop is calculated in the same way - with ngrams instead of words.
Initial conditions:
- Index settings
PUT http://localhost:9200/test
{
"mappings": {
"properties": {
"someField": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
},
"anotherField": {
"type": "text"
}
}
},
"settings": {
"number_of_shards": "1",
"number_of_replicas": "1",
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
},
"autocomplete_search": {
"tokenizer": "lowercase"
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter"
]
}
}
}
}
}
- Sample document
POST http://localhost:9200/test/_create/1
{
"someField": "one two three four five six seven eight nine ten eleven",
"anotherField": "one two three four five six seven eight nine ten eleven"
}
- Search request
POST http://localhost:9200/test/_search?typed_keys=true
{
"highlight": {
"fields": {
"someField": {},
"anotherField": {}
}
},
"query": {
"bool": {
"must": {
"dis_max": {
"tie_breaker": 0.9,
"queries": [
{
"match_phrase": {
"someField": {
"query": "thre elev",
"slop": 24
}
}
},
{
"match_phrase": {
"anotherField": {
"query": "thre elev",
"slop": 24
}
}
}
]
}
},
"filter": [
// my custom filters...
]
}
}
}
My expectations:
- While searching for
"thre elev"
I should find the given document (that's ok) - The matches should exist by both
someFIeld
andanotherField
fields (the match is available bysomeField
only cause bysearch_analyzer
setting). - There are 7 words between
three
andeleven
, but theedge_ngram
tokenization affects it, so the real slop is higher & unpredictable (that's also not ok).
Please, pay attention that I use slop of 24. That's because the request with a lesser slop returns no hits. I understand, that due to tokenizer settings the distance between these words is counted in ngrams but not in words.
I can feel that this way of search (using dis_max
of match_phrase
queries) to be the wrong approach for my type of search, but do not have an expertise to find a proper solution.
Can anything be done with this? p.s. also I want to add fuzziness into a query, but match_phrase
do not support it...