What is the approach for edge_ngram search query to have a correct word slop count?

I'm trying to make a soft prefix-search within n fields. Also the distance between tokens is a must. so I've decided to use edge_ngrams with a bool query. But as far the tokens are edge_ngrams the slop is calculated in the same way - with ngrams instead of words.

Initial conditions:

  • Index settings PUT http://localhost:9200/test
{
  "mappings": {
    "properties": {
      "someField": {
        "type": "text",
        "analyzer": "autocomplete",
        "search_analyzer": "autocomplete_search"
      },
      "anotherField": {
        "type": "text"
      }
    }
  },
  "settings": {
    "number_of_shards": "1",
    "number_of_replicas": "1",
    "analysis": {
      "analyzer": {
        "autocomplete": {
          "tokenizer": "autocomplete",
          "filter": [
            "lowercase"
          ]
        },
        "autocomplete_search": {
          "tokenizer": "lowercase"
        }
      },
      "tokenizer": {
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 20,
          "token_chars": [
            "letter"
          ]
        }
      }
    }
  }
}
  • Sample document POST http://localhost:9200/test/_create/1
{
  "someField": "one two three four five six seven eight nine ten eleven",
  "anotherField": "one two three four five six seven eight nine ten eleven"
}
  • Search request POST http://localhost:9200/test/_search?typed_keys=true
{
  "highlight": {
    "fields": {
      "someField": {},
      "anotherField": {}
    }
  },
  "query": {
    "bool": {
      "must": {
        "dis_max": {
          "tie_breaker": 0.9,
          "queries": [
            {
              "match_phrase": {
                "someField": {
                  "query": "thre elev",
                  "slop": 24
                }
              }
            },
            {
              "match_phrase": {
                "anotherField": {
                  "query": "thre elev",
                  "slop": 24
                }
              }
            }
          ]
        }
      },
      "filter": [
        //  my custom filters...
      ]
    }
  }
}

My expectations:

  1. While searching for "thre elev" I should find the given document (that's ok)
  2. The matches should exist by both someFIeld and anotherField fields (the match is available by someField only cause by search_analyzer setting).
  3. There are 7 words between three and eleven, but the edge_ngram tokenization affects it, so the real slop is higher & unpredictable (that's also not ok).

Please, pay attention that I use slop of 24. That's because the request with a lesser slop returns no hits. I understand, that due to tokenizer settings the distance between these words is counted in ngrams but not in words.

I can feel that this way of search (using dis_max of match_phrase queries) to be the wrong approach for my type of search, but do not have an expertise to find a proper solution.

Can anything be done with this? p.s. also I want to add fuzziness into a query, but match_phrase do not support it... :frowning:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.