Ngram and score in query


(weibin.wu) #1

Hi Elasticsearch.

When we do a ngram tokenizer, we will get token with start_offset.
{
"token": "vi",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "iv",
"start_offset": 1,
"end_offset": 3,
"type": "word",
"position": 1
}

Can we have like bigger "start_offset" has lower score than smaller "start_offset"?

In this case is when I have two document
{"text" : "vivo"}
{"text" : "ivov"}
All I use ngram (min_gram: 2, max_gram:2) as tokenizer.
When I search "iv", can I expect {"text" : "ivov"} has higher score than {"text" : "vivo"} because "start_offset" is smaller?
For now I see they have the same score in this case.


(Zachary Tong) #2

You're correct in that ngrams are only scored for how well they match, not the position. I don't think there is a way to weight the score based on their offset.

What's the use-case here? You want to weight matches at the start of the word higher than at the end? You could probably accomplish that manually using span queries but it'd be a huge pain. If you can describe the motivation I might be able to help work out an alternative method :slight_smile:


(weibin.wu) #3

Thanks Polyfractal,

I have a use case like this.
Two words: [vivo] [ivid].
Analyzer: ngram: min_gram:2, max_gram:2
search: "match":{"text": "vi"}
vivo doc_id: 1
ivid doc_id: 2
It supposes to give back "vivo" because searching "vi" is more likely as searching "vivo" rather than "ivid".
However, ngram return the same score, and because of doc_id, ivid will be the first document return.

Anyway to solve this?


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.