Why field length during calculating BM25 score is approximate

I use explain api to see the BM25 calculation details. I found in the tf calculation detail, the dl (field length) is not correct. the following json is what i got from explain api.

{
    "value": 0.4435187,
    "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
    "details": [
        {
            "value": 1.0,
            "description": "freq, occurrences of term within document",
            "details": []
        },
        {
            "value": 1.2,
            "description": "k1, term saturation parameter",
            "details": []
        },
        {
            "value": 0.75,
            "description": "b, length normalization parameter",
            "details": []
        },
        {
            "value": 128.0,
            "description": "dl, length of field (approximate)",
            "details": []
        },
        {
            "value": 120.666664,
            "description": "avgdl, average length of field",
            "details": []
        }
    ]
}

we can see <"description": "dl, length of field (approximate)",> I want to know why the dl is approximate.

Field length is a number of terms in the current field of the current document.

For example:
if a doc1 has a field field1 with a value "foo foo bar", its field length will be 3.
if a doc2 has a field field1 with a value "foo foo foo bar bar", its field length will be 5.

Is the question why is it approximate?

Yes, I found the field length is not correct in explain api response:

{
    "value": 128.0,
    "description": "dl, length of field (approximate)",
    "details": []
}

the real value is 131.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.