My goal is to rank my documents based the position of my keyword occurrence in the documents. For example, when I search for "Thor", and there are two documents "Batman Thor", and "Thor Marvel", then "Thor Marvel" will rank higher since "Thor" is the first word.
Then I saw there is a page in the documentation "Advanced text scoring in scripts" thinking I should use the _index variable here. I was wrestling with _index['FIELD'].get('TERM', _POSITIONS) using painless script, but it won't work . Technically, elasticsearch (I'm using 5.2.0) cannot recognize _index as a valid variable at all. It is insane because I CAN get the position and offset of those words with GET _vectorterm request just fine. I tried doc variable, but I couldn't figure out a way to get the position and offset through doc either. Does that mean that I need to do one query to get all the documents containing "Thor" and for each document, I need to do another query to get _vectorterm for the position to do the ranking ? Why can't I just access the position data when searching if they are already there?!
I wrote about (shameless plug) a method for doing this in my book.
The method involves injecting tokens at boundary positions and use those in a phrase query. You have a field, text, that you place a special token like SENTINEL_BEGIN as the first token in your field before ingesting into Elasticsearch:
SENTINEL_BEGIN Thor
Then you can add a boost query for the phrase "SENTINEL_BEGIN Thor" and as SENTINEL_BEGIN is a token only occurring at the beginning of this specific field, this query will only match in the narrow case of Thor as the first word. Something like this query:
POST /test/foo/_search
{
"query": {
"bool": {
"should": [
{"match": {
/*your normal search*/
}},
{"match_phrase": {
"text": {
"query": "SENTINEL_BEGIN Thor",
"boost": 10
/*boosted by a bunch as this is a rare, but very important event*/
}
}}
]
}
}
}
This will be far simpler than getting deep into scripts, positions, and other possible solutions.
This may seem hacky, but really the whole point of a search engine is to structure data against your own notions of relevance. You're data modeling to make things findable based on your criteria, and not out of "pure" notions of data representation.
Thank you for the reply. But I still want to use their script to get the position stored in "_vector term". Since they have the documentation, it should be doable
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.