Function_score with order


(Doron Tsur) #1

I have documents which I calculate a function_score with the search query below. The scoring seems to work well yet some documents get the same score. This causes results to repeat and to some never comer back. In essence I am sorting by scoring. I wonder if I can apply some secondary sorting to the results so order (a.k.a scoring) will be unique. For example each document has a unique string and I would like to order groups of documents with the same score by it.

{
  "from": "some_number",
  "size": "some_number",
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "should": [
            "some terms ..."
          ],
          "minimum_should_match": 1,
          "must": [
            "some must filters"
          ]
        }
      },
      "functions": [
        {
          "gauss": {
            "createdAt": {
              "scale": "30d",
              "decay": 0.8
            }
          },
          "weight": 0.3
        },
        {
          "script_score": {
            "script": {
              "source": "return _score + some calulation"
            }
          },
          "weight": 0.6
        }
      ],
      "boost_mode": "sum"
    }
  }
}

I've tried several things like applying a sort and field_value_factor, but that doesn't seem to work. I will appreciate any other tips.


(Abdon Pijpelink) #2

When you say applying a sort did not work, can you explain why not? It is a common pattern to sort on a document's _id as the secondary sort order, as a tie breaker, if you wish to get a consistent ordering.

Something like this should work:

GET _search
{
  "query": {
    "function_score": {
      "query": {
        YOUR QUERY
      },
      "functions": [
        YOUR FUNCTIONS
      ]
    }
  },
  "sort": [
    {
      "_score": {
        "order": "desc"
      }
    },
    {
      "_id": {
        "order": "desc"
      }
    }
  ]
}

(Doron Tsur) #3

I've tried this type of sorting before but reading your recommendation i've tried again. It seems that this function is the culprit, when I comment out this scoring method there are no duplicates in the results. The values are all in the same time (or around) so I understand why there are duplicates but I don't understand why the secondary sort doesn't take of that.


(Abdon Pijpelink) #4

I'm guessing that what's happening is this: all documents have a slightly different value for createdAt. As a result they get a different score. Even if two documents differ by just a few miliseconds, the score is going to be slightly different and as a result, the secondary sort order is not going to play a factor.

Would it be possible to reindex your documents with less precise values for createdAt? For example, if you round down createdAt to the closed hour, then all documents that were created at about the same time will get the same score. Then you will be able to get to your desired deterministic sort order by using the _id as the secondary sort criterion.


(Doron Tsur) #5

Perhaps I don't understand something doesn't scale do that somehow?


(Abdon Pijpelink) #6

No, scale determines how fast the score goes to zero the further you get from the current datetime. See this diagram in our documentation:


(Doron Tsur) #7

Rounding down to an hour can work but will eventually change order. Why not compare to epoch time some how (asc ordering?). Seems to make more sense stability wise.


(Abdon Pijpelink) #8

Maybe I'm misunderstanding your issue. Internally, Elasticsearch stores dates as epoch milliseconds. My assumption was that what you are seeing is caused by documents not having the exact same value for createdAt. As a result, those documents all get a different score.


(system) closed #9

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.