A question about custom scoring

Yuchen_He · February 26, 2017, 4:14am

My goal is to rank my documents based the position of my keyword occurrence in the documents. For example, when I search for "Thor", and there are two documents "Batman Thor", and "Thor Marvel", then "Thor Marvel" will rank higher since "Thor" is the first word.

Then I saw there is a page in the documentation "Advanced text scoring in scripts" thinking I should use the _index variable here. I was wrestling with _index['FIELD'].get('TERM', _POSITIONS) using painless script, but it won't work . Technically, elasticsearch (I'm using 5.2.0) cannot recognize _index as a valid variable at all. It is insane because I CAN get the position and offset of those words with GET _vectorterm request just fine. I tried doc variable, but I couldn't figure out a way to get the position and offset through doc either. Does that mean that I need to do one query to get all the documents containing "Thor" and for each document, I need to do another query to get _vectorterm for the position to do the ranking ? Why can't I just access the position data when searching if they are already there?!

I found this post on stack overflow which is pretty much the same as my problem, even though it seems to be an older version of elasticsearch http://stackoverflow.com/questions/27538766/scoring-by-term-position-in-elasticsearch. And of course, the script doesn't work.

I've been dealing with this for days. I really hope someone can give me some help on this issue.

softwaredoug · February 26, 2017, 7:27pm

I wrote about (shameless plug) a method for doing this in my book.

The method involves injecting tokens at boundary positions and use those in a phrase query. You have a field, text, that you place a special token like SENTINEL_BEGIN as the first token in your field before ingesting into Elasticsearch:

SENTINEL_BEGIN Thor

Then you can add a boost query for the phrase "SENTINEL_BEGIN Thor" and as SENTINEL_BEGIN is a token only occurring at the beginning of this specific field, this query will only match in the narrow case of Thor as the first word. Something like this query:

POST /test/foo/_search
{
    "query": {
        "bool": {
            "should": [
                {"match": {
                    /*your normal search*/
                }},
                {"match_phrase": {
                    "text": {
                       "query": "SENTINEL_BEGIN Thor",
                       "boost": 10
                       /*boosted by a bunch as this is a rare, but very important event*/
                    }
                }}
            ]
        }
    }
}

This will be far simpler than getting deep into scripts, positions, and other possible solutions.

This may seem hacky, but really the whole point of a search engine is to structure data against your own notions of relevance. You're data modeling to make things findable based on your criteria, and not out of "pure" notions of data representation.

Yuchen_He · February 27, 2017, 2:43am

Thank you for the reply. But I still want to use their script to get the position stored in "_vector term". Since they have the documentation, it should be doable

system · March 27, 2017, 2:43am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Get the keyword positions for all matched docs with just ONE search query Elasticsearch	1	472	March 29, 2017
Using term position to re-score Elasticsearch	1	742	November 23, 2017
Scoring Elasticsearch by postion of query term in the indexed document Elasticsearch	1	321	July 31, 2018
Fetching position of keyword in matched document Elasticsearch	6	8103	August 26, 2017
Score depending on position in the term on the field Elasticsearch	8	4106	July 6, 2017

A question about custom scoring

Related topics