BM25 normalizing per field

tp-c-c · October 8, 2020, 4:29pm

Hi,

I have a question about the BM25 ranking algorithm, specifically about the dividing by the field length. I understand the general descriptions of the algorithm where they divide by the document length as in this formula: BM25 .
In this formula the score is divided by the document length divided by the average document length which makes sense.

It is my understanding, however, that elasticsearch does this per field. While it makes sense to determine the importance of the specific fields that were hit, this makes hits in average length short fields just as important as hits in average length long fields. Say we have an index where the average title field is length 5 and the average summary field is length 2000. A hit in an 8 word title field of documen1 would be ranked as less important than a hit in a 1000 word summary field of document2, resulting in document2 being ranked above document1. This seems less than ideal, as the hit in the summary had a more than 100 times higher chance to match.

Is my interpretation correct? If so, is there any way to not get this behavior? I would like to avoid doing this by giving hardcoded boosts to specific fields It would be ideal if the importance of fields would just scale inversely with length like in tf-idf. I don't want to throw away BM25 if I can help it though, as it has other advantages over tf-idf.

system · November 5, 2020, 4:29pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Field length and average field lengths BM25 Elasticsearch	3	1036	March 22, 2021
AvgFieldLength seem wrong Elasticsearch	1	544	July 6, 2017
AvgFieldLength seem wrong Elasticsearch	1	487	July 6, 2017
Field-length norm fails on fields with 3 and 4 words Elasticsearch	2	380	July 6, 2017
Normalize Elasticsearch score with subfield length Elasticsearch	1	362	May 2, 2019

BM25 normalizing per field

Related topics