Significant_terms weak performance and inconsistent results


#1

Hi.

After upgrading Elasticsearch from 2.x to 6.1.2 (and creating a new index from the scratch), I am seeing significant_terms behavior I don't understand. Pretty much only thing that has changed is the type of productNodeIds changing from long to integer, but that should make no difference?

The query & mappings

{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "applicationDate": {
              "gte": "2015-01-01"
            }
          }
        },
        {
          "terms": {
            "owners.id": ["/owner/foo", "/owner/bar"]
          }
        }
      ]
    }
  },
  "aggregations": {
    "significantProducts": {
      "significant_terms": {
        "field": "productNodeIds",
        "size": 3,
        "min_doc_count": 10,
        "background_filter": {
          "bool": {
            "filter": [
              {
                "range": {
                  "applicationDate": {
                    "gte": "2008-01-01"
                  }
                }
              },
              {
                "terms": {
                  "owners.id": ["/owner/foo", "/owner/bar"]
                }
              }
            ]
          }
        }
      }
    }
  }
}

Having mappings:

"applicationDate": {
  "type": "date",
  "format": "date"
},
"productNodeIds": {
  "type": "integer"
},
"owners": {
  "properties": {
    "id": {
      "type": "keyword"
    }
  }
}

And the response:

{
  "took": 91839,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 256,
    "max_score": 0.0,
    "hits": []
  },
  "aggregations": {
    "significantProducts": {
      "doc_count": 256,
      "bg_count": 1314,
      "buckets": [
        {
          "key": 62419,
          "doc_count": 22,
          "score": 9.6182861328125,
          "bg_count": 0
        },
        {
          "key": 87339,
          "doc_count": 22,
          "score": 9.6182861328125,
          "bg_count": 0
        },
        {
          "key": 10188,
          "doc_count": 22,
          "score": 9.6182861328125,
          "bg_count": 0
        }
      ]
    }
  }
}

Questions

  1. How can the query be so slow (92 seconds) given there are only 256 foreground hits, and 1314 background hits. The queries itself are instant, or if I do e.g. terms aggregation instead of significant_terms it's pretty much instant also.
  2. The aggregation buckets make no sense. The background_filter is superset of foreground, so how can the bg_count be zero? It shouldn't be possible to be smaller than the doc_count.

What on earth is happening here?

Thank you! :slight_smile:


(Mark Harwood) #2

This is down to a change in Lucene-land.
Numerics were optimized for performing range queries and not direct look-ups. Previously we could directly lookup the doc frequency (DF) of productId:2342 but this feature was removed because the assumption was no one is interested in the DF of numerics which are usually used to represent quantities like time or money. However numerics, as in your case, are also used for IDs and things like significant_terms do have a use for DF but are now forced to derive this by running a query for the ID and counting the number of matching docs. Sucks, huh?

The solution is to index numeric IDs as keyword and the lookups can then hit a DF value stored by Lucene.


#3

Sucks it indeed does. :cry:

But thanks for the quick response. This was driving me nuts...


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.