When we do a ngram tokenizer, we will get token with start_offset.
{
"token": "vi",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "iv",
"start_offset": 1,
"end_offset": 3,
"type": "word",
"position": 1
}
Can we have like bigger "start_offset" has lower score than smaller "start_offset"?
In this case is when I have two document
{"text" : "vivo"}
{"text" : "ivov"}
All I use ngram (min_gram: 2, max_gram:2) as tokenizer.
When I search "iv", can I expect {"text" : "ivov"} has higher score than {"text" : "vivo"} because "start_offset" is smaller?
For now I see they have the same score in this case.
You're correct in that ngrams are only scored for how well they match, not the position. I don't think there is a way to weight the score based on their offset.
What's the use-case here? You want to weight matches at the start of the word higher than at the end? You could probably accomplish that manually using span queries but it'd be a huge pain. If you can describe the motivation I might be able to help work out an alternative method
I have a use case like this.
Two words: [vivo] [ivid].
Analyzer: ngram: min_gram:2, max_gram:2
search: "match":{"text": "vi"}
vivo doc_id: 1
ivid doc_id: 2
It supposes to give back "vivo" because searching "vi" is more likely as searching "vivo" rather than "ivid".
However, ngram return the same score, and because of doc_id, ivid will be the first document return.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.