JLH score calculation


(‬‏) #1

I'm trying to manually calculate the jlh score while running a significant term query and I don't get the same score.

Steps:

query:

{
  "query": {
    "terms": {
      "${myFieldName}": [
        "hurricane"
      ]
    }
  },
  "aggregations": {
    "significant_storm_types": {
      "significant_terms": {
        "field": "${myFieldName}"
        , "size": 100, "jlh": {}
      }
    }
  }
}

response:
...

"hits": {
    "total": 106,

...

"aggregations": {
    "significant_storm_types": {
      "doc_count": 106,
      "buckets": [
        {
          "key": "hurricane",
          "doc_count": 105,
          "score": 1407.7837073557366,
          "bg_count": 106
        }

The total number of documents in this type:

"hits": {
    "total": 50956,

Now we have everything for the calculation (from the documentation):
jlh score = (foregroundPercent - backgroundPercent) ) * (foregroundPercent/backgroundPercent)
Putting the numbers inside:
jlh score = (105/106 - 106/50956) * ((105/106) / (106/50956)) = 470.699067
Which does not equal the score of the term in the bucket (1407.7837073557366).
Where seems to be the issue?

*And also, in the response, the total hits (106) is different the doc_count of the term's bucket (105). This was surprising since I'm working with only one shard.

Thanks


(Mark Harwood) #2

Are you sure the bg size of 50956 is correct?

The default background is the number of docs in the index - not those of a type. You can use a background_filter if you want to finesse what you use as a background but it will be slower to run.

If I substitute the bg size of 152188 I get a jlh score of 1407.783707 which looks close to what you are seeing

That is odd. Is this data that you can share?


(‬‏) #3

Bg size was indeed the key. When looking at the /_stats endpoint of index and taking into account all the Elastic docs (count and deleted) of the index, I got the same results as in the response
query:

GET ${my_index}/_stats

response:

"total": {
      "docs": {
        "count": 112943,
        "deleted": 39245
      }

`
So I'm wondering why should other types in the index even participate in the calculation.


(Mark Harwood) #5

Significant terms uses some of the pre-computed stats that Lucene stores for rapid access. Some of these stats do not reflect constructs elasticsearch layers on top such as multiple types (which is part of why they are going away) and nested docs.
Multiple types is an imperfect abstraction - fields for example could be shared across types and while you might want bgSize to represent the number of docs of a type the bgCount for the term X in field Y could be bigger than bgSize which would be nonsensical. This is one of the reasons why bgSize is all the docs in the Lucene index and not just those of a type.


(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.