How to tuning terms aggregation performance

kkd927 · June 10, 2019, 2:36am

Hi there

We have a problem with our ES terms aggregation query, it took 10-12s to execute.

here is our cluster information

we have 3 client nodes, 3 master nodes, 5 data nodes, 1 ingest node
each node, its 20 cores(40vCore) and 64GB memeory, we assigned 31GB to the heap
index have 5 shards and 1 replica
index size is around 340GB (primary - 165GB) and document size is around 934.4m
ES version is 6.5.4

index have document's tag information.

{
  "mapping": {
    "_doc": {
      "_field_names": {
        "enabled": false
      },
      "properties": {
        "blogId": {
          "type": "keyword"
        },
        "tag": {
          "type": "keyword",
          "boost": 30,
          "eager_global_ordinals": true,
          "copy_to": [
            "tagNgram"
          ]
        },
        "tagNgram": {
          "type": "text",
          "analyzer": "ngram_analyzer",
          "search_analyzer": "standard"
        }
      }
    }
  }
}

data seems like:

{
  "blogId": "00001",
  "tag": "APPLE"
},
{
  "blogId": "00001",
  "tag": "BANANA"
},
{
  "blogId": "00001",
  "tag": "ORANGE"
},
{
  "blogId": "00002",
  "tag": "APPLE"
},
{
  "blogId": "00003",
  "tag": "PEACH"
},
{
  "blogId": "00003",
  "tag": "BANANA"
}

here is my query

GET /tag_search_index/_doc/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": {
        "match": { "tagNgram": "A" }
      },
      "must_not": {
        "term": { "tag": "A" }
      }
    }
  },
  "aggs": {
    "most_popular": {
      "terms": {
        "field": "tag",
        "size": 10
      }
    },
    "count":{
      "cardinality": {
        "field": "tag"
      }
    }
  }
}

response is

{
  "took" : 12380,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 64946917,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "count" : {
      "value" : 7202919
    },
    "most_popular" : {
      "doc_count_error_upper_bound" : 46346,
      "sum_other_doc_count" : 61546148,
      "buckets" : [
         // ...
      ]
    }
  }
}

Searching with more than two characters speeds up your search. But if searching with one letter, it takes about 10 seconds.

Is there any way to make it faster?

thanks in advance.

Mark_Harwood · June 10, 2019, 6:43am

You could try set the “collect mode” on the terms aggregation to breadth-first.
I’m not sure if that’s automatically picked for this particular request but enabling it means we compute the top 10 tags first before computing their child cardinality aggs (as opposed to calculating all cardinalities then pruning to top 10)

kkd927 · June 10, 2019, 7:04am

In the query I'm calling, cardinality aggs is not child of terms aggs.

"aggs": {
    "most_popular": {
      "terms": {
        "field": "tag",
        "size": 10
      }
    },
    "count":{
      "cardinality": {
        "field": "tag"
      }
    }
  }

I want to aggregate the number of all tags that contain A characters, and separately check the top 10 tags.

I understand that collect mode is only available in the aggs of parent-child relationships, am I right?

(Please understand that I am not good at English.)

Mark_Harwood · June 10, 2019, 8:13am

Ah. My bad.

Do blogs have multiple tags? If so then your counts and top tens could consist of things that co-occur with A* tags rather than just being A* tags

kkd927 · June 10, 2019, 9:39am

yes, each blog have mulitple tags.

00001 blog : APPLE, BANANA, ORANGE
00002 blog : APPLE
00003 blog : PEACH, BANANA
...

what means "co-occure with A* tags"?

I want to get result following:

counts of all tags that contain A characters.
Top N tags that contain A characters.

So this is the query that I want.

GET /tag_search_index/_doc/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": {
        "match": { "tagNgram": "A" }
      },
      "must_not": {
        "term": { "tag": "A" }
      }
    }
  },
  "aggs": {
    "most_popular": {
      "terms": {
        "field": "tag",
        "size": 10
      }
    },
    "count":{
      "cardinality": {
        "field": "tag"
      }
    }
  }
}

result is

"aggregations" : {
    "count" : {
      // counts of all tags that contain A characters.
      "value" : 7202919
    },
    "most_popular" : {
      "doc_count_error_upper_bound" : 46346,
      "sum_other_doc_count" : 61546148,
      "buckets" : [
          // Top N tags that contain A characters.
         {
           "key": "APPLE",
           "doc_count": "15667"
         },
         {
           "key": "BANANA",
           "doc_count": "11491"
         },
         // ...
      ]
    }

Mark_Harwood · June 10, 2019, 4:14pm

Queries serve only to filter documents - not the values that appear in those documents.

The aggregations work on all the values in the filtered documents. You can use an ‘include’ clause with a regular expression inside the ‘terms’ aggregation to consider only tags that start with A.

system · July 8, 2019, 4:21pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to tuning aggregation performance Elasticsearch	2	1957	July 5, 2017
Performance memory swapping Windows? Elasticsearch	6	1338	July 6, 2017
Further optimization to ES queries / performance Elasticsearch	1	343	September 3, 2020
Aggregations taking way too long? Elasticsearch	7	313	May 24, 2022
Nested Aggregations are 5~10x times slower in ES 6.x than 5.6.x Elasticsearch	13	3579	July 16, 2018

How to tuning terms aggregation performance

Related topics