Significant_terms weak performance and inconsistent results

Pyppe · January 29, 2018, 1:21pm

Hi.

After upgrading Elasticsearch from 2.x to 6.1.2 (and creating a new index from the scratch), I am seeing significant_terms behavior I don't understand. Pretty much only thing that has changed is the type of productNodeIds changing from long to integer, but that should make no difference?

The query & mappings

{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "applicationDate": {
              "gte": "2015-01-01"
            }
          }
        },
        {
          "terms": {
            "owners.id": ["/owner/foo", "/owner/bar"]
          }
        }
      ]
    }
  },
  "aggregations": {
    "significantProducts": {
      "significant_terms": {
        "field": "productNodeIds",
        "size": 3,
        "min_doc_count": 10,
        "background_filter": {
          "bool": {
            "filter": [
              {
                "range": {
                  "applicationDate": {
                    "gte": "2008-01-01"
                  }
                }
              },
              {
                "terms": {
                  "owners.id": ["/owner/foo", "/owner/bar"]
                }
              }
            ]
          }
        }
      }
    }
  }
}

Having mappings:

"applicationDate": {
  "type": "date",
  "format": "date"
},
"productNodeIds": {
  "type": "integer"
},
"owners": {
  "properties": {
    "id": {
      "type": "keyword"
    }
  }
}

And the response:

{
  "took": 91839,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 256,
    "max_score": 0.0,
    "hits": []
  },
  "aggregations": {
    "significantProducts": {
      "doc_count": 256,
      "bg_count": 1314,
      "buckets": [
        {
          "key": 62419,
          "doc_count": 22,
          "score": 9.6182861328125,
          "bg_count": 0
        },
        {
          "key": 87339,
          "doc_count": 22,
          "score": 9.6182861328125,
          "bg_count": 0
        },
        {
          "key": 10188,
          "doc_count": 22,
          "score": 9.6182861328125,
          "bg_count": 0
        }
      ]
    }
  }
}

Questions

How can the query be so slow (92 seconds) given there are only 256 foreground hits, and 1314 background hits. The queries itself are instant, or if I do e.g. terms aggregation instead of significant_terms it's pretty much instant also.
The aggregation buckets make no sense. The background_filter is superset of foreground, so how can the bg_count be zero? It shouldn't be possible to be smaller than the doc_count.

What on earth is happening here?

Thank you!

Mark_Harwood · January 29, 2018, 1:47pm

This is down to a change in Lucene-land.
Numerics were optimized for performing range queries and not direct look-ups. Previously we could directly lookup the doc frequency (DF) of productId:2342 but this feature was removed because the assumption was no one is interested in the DF of numerics which are usually used to represent quantities like time or money. However numerics, as in your case, are also used for IDs and things like significant_terms do have a use for DF but are now forced to derive this by running a query for the ID and counting the number of matching docs. Sucks, huh?

The solution is to index numeric IDs as keyword and the lookups can then hit a DF value stored by Lucene.

Pyppe · January 29, 2018, 2:01pm

Sucks it indeed does.

But thanks for the quick response. This was driving me nuts...

system · February 26, 2018, 2:01pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Performance degradation of Significant Terms Aggregation after upgrade (v2.4 -> v5.4) Elasticsearch	8	926	July 21, 2017
Significant terms buggy due to wrong bg_count Elasticsearch	8	750	March 18, 2020
Significant Term aggregation Elasticsearch	9	624	July 6, 2017
Odd Terms Filter Caching Issue Elasticsearch	1	434	July 6, 2017
Significant terms, ES5, and memory issues Elasticsearch	7	788	December 16, 2016

Significant_terms weak performance and inconsistent results

The query & mappings

And the response:

Questions

Related topics