Scoring autcomplete (edgeNGram) results


(Andy Lin) #1

I have a name field where I am applying the following filter and analyzer:

"filter": {
    "autocomplete_filter": {
        "min_gram": "1",
        "type": "edgeNGram",
        "max_gram": "50"
    },
    ...
}
...
"analyzer": {
    "autocomplete": {
        "type": "custom",
        "filter": [
            "lowercase",
            "autocomplete_filter"
        ],
        "tokenizer": "standard"
    },
    ...
}

Now, I have a name field with an index analyzer of autocomplete. When I store the name field with a value of, say "abcdef", the autocomplete analyzer will store them as tokens of "a", "ab", "abc"... "abcdef".

If I have documents with values "abc", "abcd", "abcde", they can all be found.

However, I want to be able to score my result in such a way that if I search for "abc", the document with the source value "abc" will rank higher than "abcd" and "abcde". However all things being equal all three results will have the same score.

Is there a way I can structure my index analyzer or search analyzer so that I don't lose the benefit of autocomplete but is able to influence the search result rank?


(Nik Everett) #2

There are lots of ways! First I should point out that you should look at the completion suggestion for autocomplete. I don't know as much about it other than that it was written for autocomplete. I use edgeNGram for it like you do.

Ok, that out of the way there are a couple of things you could do:

  1. Do a bool query where the exact match and the ngram are both in should clauses. Boost the exact match.
  2. Add a function_score query that multiplies the score by some value derived from the length. Something like 1/length or length/(length + 1) or something. The longer matches would get sorted to the end. Mostly.
  3. Use some external popularity metric.
  4. Switch from sort: relevance to sort: some_custom_value.

We use 1 and 3. You can see that by going to en.wikipedia.org and typing "a" into the search box on the upper right. "A", the exact match, is first. "Australia" is second because it has the most incoming links. Its nowhere near perfect but it gets the job done.


(Andy Lin) #3

I was thinking of trying approach 2 but it seemed to introduce a lot of extra overhead (enabling dynamic scripting, etc)

1 seems like a much better approach, I'll try that.

Thanks!


(Nik Everett) #4

If you are a reasonably modern version of Elasticsearch you can use the lucene expression language: https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting.html

Its fully sandboxed and enabled by default.


(Andy Lin) #5

Thanks, will take a look!


(system) #6