Hi,
I have a question about the BM25 ranking algorithm, specifically about the dividing by the field length. I understand the general descriptions of the algorithm where they divide by the document length as in this formula: .
In this formula the score is divided by the document length divided by the average document length which makes sense.
It is my understanding, however, that elasticsearch does this per field. While it makes sense to determine the importance of the specific fields that were hit, this makes hits in average length short fields just as important as hits in average length long fields. Say we have an index where the average title field is length 5 and the average summary field is length 2000. A hit in an 8 word title field of documen1 would be ranked as less important than a hit in a 1000 word summary field of document2, resulting in document2 being ranked above document1. This seems less than ideal, as the hit in the summary had a more than 100 times higher chance to match.
Is my interpretation correct? If so, is there any way to not get this behavior? I would like to avoid doing this by giving hardcoded boosts to specific fields It would be ideal if the importance of fields would just scale inversely with length like in tf-idf. I don't want to throw away BM25 if I can help it though, as it has other advantages over tf-idf.