JLH score calculation

Adam4 · January 17, 2018, 6:25am

I'm trying to manually calculate the jlh score while running a significant term query and I don't get the same score.

Steps:

query:

{
  "query": {
    "terms": {
      "${myFieldName}": [
        "hurricane"
      ]
    }
  },
  "aggregations": {
    "significant_storm_types": {
      "significant_terms": {
        "field": "${myFieldName}"
        , "size": 100, "jlh": {}
      }
    }
  }
}

response:
...

"hits": {
    "total": 106,

...

"aggregations": {
    "significant_storm_types": {
      "doc_count": 106,
      "buckets": [
        {
          "key": "hurricane",
          "doc_count": 105,
          "score": 1407.7837073557366,
          "bg_count": 106
        }

The total number of documents in this type:

"hits": {
    "total": 50956,

Now we have everything for the calculation (from the documentation):
jlh score = (foregroundPercent - backgroundPercent) ) * (foregroundPercent/backgroundPercent)
Putting the numbers inside:
jlh score = (105/106 - 106/50956) * ((105/106) / (106/50956)) = 470.699067
Which does not equal the score of the term in the bucket (1407.7837073557366).
Where seems to be the issue?

*And also, in the response, the total hits (106) is different the doc_count of the term's bucket (105). This was surprising since I'm working with only one shard.

Thanks

Mark_Harwood · January 17, 2018, 9:42am

Are you sure the bg size of 50956 is correct?

The default background is the number of docs in the index - not those of a type. You can use a background_filter if you want to finesse what you use as a background but it will be slower to run.

If I substitute the bg size of 152188 I get a jlh score of 1407.783707 which looks close to what you are seeing

That is odd. Is this data that you can share?

Adam4 · January 21, 2018, 7:56pm

Bg size was indeed the key. When looking at the /_stats endpoint of index and taking into account all the Elastic docs (count and deleted) of the index, I got the same results as in the response
query:

GET ${my_index}/_stats

response:

"total": {
      "docs": {
        "count": 112943,
        "deleted": 39245
      }

`
So I'm wondering why should other types in the index even participate in the calculation.

Mark_Harwood · January 22, 2018, 9:39am

Significant terms uses some of the pre-computed stats that Lucene stores for rapid access. Some of these stats do not reflect constructs elasticsearch layers on top such as multiple types (which is part of why they are going away) and nested docs.
Multiple types is an imperfect abstraction - fields for example could be shared across types and while you might want bgSize to represent the number of docs of a type the bgCount for the term X in field Y could be bigger than bgSize which would be nonsensical. This is one of the reasons why bgSize is all the docs in the Lucene index and not just those of a type.

system · February 19, 2018, 9:39am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How is the score of Significant Term aggregation calculated? Elasticsearch	7	625	September 12, 2018
JLH score for significant terms Elasticsearch	3	3446	July 5, 2017
Perform significant terms aggregation in Elastic search based on sum of a field rather than count if documents Elasticsearch	2	390	December 10, 2019
Bg_counts in nested significant_terms aggregation Elasticsearch	3	1276	July 5, 2017
Detail questions about significant_terms aggregation Elasticsearch	1	322	July 6, 2017

JLH score calculation

Related topics