How to tuning terms aggregation performance

Hi there

We have a problem with our ES terms aggregation query, it took 10-12s to execute.

here is our cluster information

  1. we have 3 client nodes, 3 master nodes, 5 data nodes, 1 ingest node
  2. each node, its 20 cores(40vCore) and 64GB memeory, we assigned 31GB to the heap
  3. index have 5 shards and 1 replica
  4. index size is around 340GB (primary - 165GB) and document size is around 934.4m
  5. ES version is 6.5.4

index have document's tag information.

{
  "mapping": {
    "_doc": {
      "_field_names": {
        "enabled": false
      },
      "properties": {
        "blogId": {
          "type": "keyword"
        },
        "tag": {
          "type": "keyword",
          "boost": 30,
          "eager_global_ordinals": true,
          "copy_to": [
            "tagNgram"
          ]
        },
        "tagNgram": {
          "type": "text",
          "analyzer": "ngram_analyzer",
          "search_analyzer": "standard"
        }
      }
    }
  }
}

data seems like:

{
  "blogId": "00001",
  "tag": "APPLE"
},
{
  "blogId": "00001",
  "tag": "BANANA"
},
{
  "blogId": "00001",
  "tag": "ORANGE"
},
{
  "blogId": "00002",
  "tag": "APPLE"
},
{
  "blogId": "00003",
  "tag": "PEACH"
},
{
  "blogId": "00003",
  "tag": "BANANA"
}

here is my query

GET /tag_search_index/_doc/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": {
        "match": { "tagNgram": "A" }
      },
      "must_not": {
        "term": { "tag": "A" }
      }
    }
  },
  "aggs": {
    "most_popular": {
      "terms": {
        "field": "tag",
        "size": 10
      }
    },
    "count":{
      "cardinality": {
        "field": "tag"
      }
    }
  }
}

response is

{
  "took" : 12380,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 64946917,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "count" : {
      "value" : 7202919
    },
    "most_popular" : {
      "doc_count_error_upper_bound" : 46346,
      "sum_other_doc_count" : 61546148,
      "buckets" : [
         // ...
      ]
    }
  }
}

Searching with more than two characters speeds up your search. But if searching with one letter, it takes about 10 seconds.

Is there any way to make it faster?

thanks in advance.

You could try set the “collect mode” on the terms aggregation to breadth-first.
I’m not sure if that’s automatically picked for this particular request but enabling it means we compute the top 10 tags first before computing their child cardinality aggs (as opposed to calculating all cardinalities then pruning to top 10)

In the query I'm calling, cardinality aggs is not child of terms aggs.

"aggs": {
    "most_popular": {
      "terms": {
        "field": "tag",
        "size": 10
      }
    },
    "count":{
      "cardinality": {
        "field": "tag"
      }
    }
  }

I want to aggregate the number of all tags that contain A characters, and separately check the top 10 tags.

I understand that collect mode is only available in the aggs of parent-child relationships, am I right?

(Please understand that I am not good at English.)

Ah. My bad.

Do blogs have multiple tags? If so then your counts and top tens could consist of things that co-occur with A* tags rather than just being A* tags

yes, each blog have mulitple tags.

00001 blog : APPLE, BANANA, ORANGE
00002 blog : APPLE
00003 blog : PEACH, BANANA
...

what means "co-occure with A* tags"?

I want to get result following:

  1. counts of all tags that contain A characters.
  2. Top N tags that contain A characters.

So this is the query that I want.

GET /tag_search_index/_doc/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": {
        "match": { "tagNgram": "A" }
      },
      "must_not": {
        "term": { "tag": "A" }
      }
    }
  },
  "aggs": {
    "most_popular": {
      "terms": {
        "field": "tag",
        "size": 10
      }
    },
    "count":{
      "cardinality": {
        "field": "tag"
      }
    }
  }
}

result is

"aggregations" : {
    "count" : {
      // counts of all tags that contain A characters.
      "value" : 7202919
    },
    "most_popular" : {
      "doc_count_error_upper_bound" : 46346,
      "sum_other_doc_count" : 61546148,
      "buckets" : [
          // Top N tags that contain A characters.
         {
           "key": "APPLE",
           "doc_count": "15667"
         },
         {
           "key": "BANANA",
           "doc_count": "11491"
         },
         // ...
      ]
    }

Queries serve only to filter documents - not the values that appear in those documents.

The aggregations work on all the values in the filtered documents. You can use an ‘include’ clause with a regular expression inside the ‘terms’ aggregation to consider only tags that start with A.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.