Performance degradation of Significant Terms Aggregation after upgrade (v2.4 -> v5.4)

We tried to migrate from 2.4 to 5.4, but we noticed quite significant performance degradation.

Significantly decreases especially in Significant Terms Aggregation.

We consider the cange about collect_mode is involved, but are there other considerations?

Sample Query

{
  "query": {
    "query_string": {
      "query": "some_ids:259352",
      "default_operator": "AND"
    }
  },
  "size": 10,
  "aggs": {
    "org_cat": {
      "significant_terms": {
        "field": "org_cat",
        "shard_size": 300,
        "min_doc_count": 10,
        "gnd": {
          "background_is_superset": false
        },
        "size": 100
      }
    },
    "keyword": {
      "significant_terms": {
        "field": "keyword",
        "shard_size": 300,
        "min_doc_count": 10,
        "gnd": {
          "background_is_superset": false
        },
        "size": 100
      }
    },
    "domain": {
      "significant_terms": {
        "field": "domain",
        "shard_size": 300,
        "min_doc_count": 10,
        "gnd": {
          "background_is_superset": false
        },
        "size": 100
      }
    },
    "ua_name": {
      "significant_terms": {
        "field": "ua_name",
        "shard_size": 300,
        "min_doc_count": 10,
        "gnd": {
          "background_is_superset": false
        },
        "size": 100
      }
    },
    "ip": {
      "significant_terms": {
        "field": "ip_addr",
        "shard_size": 300,
        "min_doc_count": 10,
        "gnd": {
          "background_is_superset": false
        },
        "size": 100
      }
    },
    "org_name": {
      "significant_terms": {
        "field": "org_name",
        "shard_size": 300,
        "min_doc_count": 10,
        "gnd": {
          "background_is_superset": false
        },
        "size": 100
      }
    },
    "customer_ids": {
      "significant_terms": {
        "field": "customer_ids",
        "shard_size": 300,
        "min_doc_count": 10,
        "gnd": {
          "background_is_superset": false
        },
        "size": 100
      }
    },
    "pref_code": {
      "significant_terms": {
        "field": "pref_code",
        "shard_size": 300,
        "min_doc_count": 10,
        "gnd": {
          "background_is_superset": false
        },
        "size": 100
      }
    },
    "org_emp_code": {
      "significant_terms": {
        "field": "org_emp_code",
        "shard_size": 300,
        "min_doc_count": 10,
        "gnd": {
          "background_is_superset": false
        },
        "size": 100
      }
    },
    "ua_os": {
      "significant_terms": {
        "field": "ua_os",
        "shard_size": 300,
        "min_doc_count": 10,
        "gnd": {
          "background_is_superset": false
        },
        "size": 100
      }
    },
    "org_gross_code": {
      "significant_terms": {
        "field": "org_gross_code",
        "shard_size": 300,
        "min_doc_count": 10,
        "gnd": {
          "background_is_superset": false
        },
        "size": 100
      }
    },
    "some_ids": {
      "significant_terms": {
        "field": "segment_ids",
        "shard_size": 300,
        "min_doc_count": 10,
        "gnd": {
          "background_is_superset": false
        },
        "size": 100
      }
    }
  }
}

2.4 Response - 16s

{
    "took": 16564,
    "timed_out": false,
    "_shards": {
        "total": 480,
        "successful": 480,
        "failed": 0
    },
    "hits": {
        "total": 2965312,
        "max_score": 5.930258,
        "hits": []
    },
    "aggregations": {
        "ua_name": {},
        "org_name.raw": {},
        "segment_ids": {},
        "org_gross_code": {},
        "domain": {},
        ....
    }
}

5.4 Response - 1.5m

{
    "took": 91375,
    "timed_out": false,
    "_shards": {
        "total": 480,
        "successful": 480,
        "failed": 0
    },
    "hits": {
        "total": 2948700,
        "max_score": 1,
        "hits": []
    },
    "aggregations": {
        "ua_name": {},
        "org_name.raw": {},
        "segment_ids": {},
        "org_gross_code": {},
        "domain": {},
        ....
    }
}

Other info

Try narrow down response times for the individual fields to find the culprit.

My expectation is that it is the fields of type ip that are the problem. In more recent versions of Lucene the IP field does not hold a count we can directly look up for background doc frequency (DF). The work-around used internally to find the frequency for IP values was to effectively run a query that examines the postings list for ip X and count the number of docs, similarly to how we look up DFs if you apply a custom background_filter. This is obviously slower.

The workaround for you would be to also index ip fields as type keyword which would retain the ability to do fast lookups for DF e.g.

      "remote_host_ip" : {
        "type" : "ip",
        "fields" : {
          "asKeyword" : {
            "type" : "keyword"
          }
        }
      },

.. then do significant_terms agg in the remote_host.asKeyword field

Thanks Mark !

Your advice was effective.
Does this effect also apply to numeric(integer, long...) ?

And I feel that a big shard_size causes a crash (because GC), but on 2.4.5 it is not caused.
Is this a specification?

  • shard_size < 1000 : 5.4 is Faster
  • shard_size < 20000 : 2.4 is Faster
  • shard_size > 20000 : 5.4 is crashed (Heavy GC)

Query

{
    "query": {
        "query_string": {
            "query": "some_ids:10000",
            "default_operator": "AND"
        }
    },
    "aggs": {
        "keywords.raw": {
            "significant_terms": {
                "field": "keywords.raw",
                "shard_size": 100000,
                "min_doc_count": 10,
                "gnd": {
                    "background_is_superset": false
                },
                "size": 100000
            }
        }
    },
    "size": 0
}

Unfortunately, yes. The related code to fix the frequency lookups was in Make significant terms work on fields that are indexed with points. by jpountz · Pull Request #18031 · elastic/elasticsearch · GitHub

By treating the numbers/ips as 'keyword' fields the internal agg working state required to hold large sets of match candidates may also be less efficient in terms of ram than previous versions.

Is it caused only by treating the numbers/ips as 'keyword' fields ?

I think that
about 'keyword' fields (not_analyzed string in 2.4), Significant Terms Aggregation may also be less efficient in terms of ram than previous versions.
Isn't it ?

One other thing to look at - the various settings for 'execution_hint' have a part to play in holding of interim state.

Thanks Mark.
I will check it !!

Let's go drink someday : )

Always good to hear what people are using significant terms to find :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.