Terms Aggregation on billions of documents

bevans88 · October 10, 2017, 8:55am

Hi,

I'm wondering if anyone has any experience with running terms aggregations, with a sub sum aggregation, on a very large index (~100 billion documents). The documents themselves are relatively small (~20 fields). Some of these fields can be high cardinality (~ 40 million).

The requirement here is that we achieve around a 5 second query response. The worse case query for the aggregation is a match_all.

As far as I'm aware to achieve this scale we'd require a very large cluster, which even then might not achieve the required performance.

Does anyone have any experience/thoughts on this?

Thanks,

Brent

bevans88 · October 11, 2017, 8:02am

To give some additional information, the mapping for the documents is currently:

 {
  "_all" : { "enabled": false  },
  "properties" : {
    "field1" : {
      "type" : "text",
      "fielddata": true,
      "fields" : {
        "keyword" : {
          "type" : "keyword",
          "eager_global_ordinals": true
        }
      }
    },
    "field2" : {
      "type" : "text",
      "fielddata": true,
      "fields" : {
        "keyword" : {
          "type" : "keyword",
          "eager_global_ordinals": true
        }
      }
    },
    "field3" : {
      "type" : "text",
      "fielddata": true,
      "fields" : {
        "keyword" : {
          "type" : "keyword",
          "eager_global_ordinals": true
        }
      }
    },
    "field4" : {
      "type" : "text",
      "fielddata": true,
      "fields" : {
        "keyword" : {
          "type" : "keyword",
          "eager_global_ordinals": true
        }
      }
    },
    "field5" : {
      "type" : "text",
      "fielddata": true,
      "fields" : {
        "keyword" : {
          "type" : "keyword",
          "eager_global_ordinals": true
        }
      }
    },
    "field6" : {
      "type" : "date",
      "format" : "basic_date_time_no_millis"
    },
    "field7" : {
      "type" : "date",
      "format" : "basic_date_time_no_millis"
    },
    "field8" : {
      "type" : "date",
      "format" : "basic_date_time_no_millis"
    },
    "field9" : {
      "type": "text"
    },
    "field10" : {
      "type": "text"
    },
    "field11" : {
      "type" : "text",
      "fielddata": true,
      "fields" : {
        "keyword" : {
          "type" : "keyword",
          "eager_global_ordinals": true
        }
      }
    },
    "field12" : {
      "type" : "long"
    }
  }
}

And an example aggregation would be:

{
  "size" : 0,
  "query" : {
    "match_all" : { }
  },
  "_source" : false,
  "aggregations" : {
    "by_terms" : {
      "terms" : {
        "field" : "field11.keyword",
        "size" : 10000,
        "shard_size" : 1000000,
        "min_doc_count" : 1,
        "shard_min_doc_count" : 0,
        "show_term_doc_count_error" : false,
        "order" : [
          {
            "_count" : "desc"
          },
          {
            "_term" : "asc"
          }
        ]
      },
      "aggregations" : {
        "totalSize" : {
          "sum" : {
            "field" : "size"
          }
        }
      }
    }
  }
}

system · November 8, 2017, 8:02am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Running aggregation over more than 10,000 documents Elasticsearch	1	1367	January 7, 2020
Terms aggregation on high cardinality field Elasticsearch aggregations	9	45	November 14, 2024
Aggregation over big data Elasticsearch	1	322	March 25, 2021
Sub aggregations on aggregations with 'limited' results (e.g. terms) Elasticsearch	4	504	July 6, 2017
Slow terms aggregations after use of eager_global_ordinals Elasticsearch	6	760	November 9, 2020

Terms Aggregation on billions of documents

Related topics