Elasticsearch aggregation on million or more data

Nowrin_Hossain · January 6, 2022, 11:17am

I have an index which can have millions of documents with time. I need to make several aggregation on those huge amount of documents fields. In these documents, I need to count number of unique value of a field which may grow large with number of documents like more than million. I know "cardinality" aggregation. But this doesn't give right count, this gives approximation. Then I tried to used scripted_metric.

{
"aggs": {
                            "distinct_count": {
                                "scripted_metric": {
                                    "params": {
                                        "fieldName": "some_field_name"
                                    },
                                    "init_script": "state.list = []",
                                    "map_script": "if(doc[params.fieldName] != null)state.list.add(doc[params.fieldName].value);",
                                    "combine_script": "return state.list;",
                                    "reduce_script": "Map uniqueValueMap = new HashMap(); int count = 0;for(shardList in states) {if(shardList != null) { for(key in shardList) {if(!uniqueValueMap.containsKey(key)) {count +=1;uniqueValueMap.put(key, key);}}}} return count;"
                                }
                            }
                        }
}

But this gives me right value only upto 1 million data. When number of documents get larger, this gives error.
So, what is the right way to make aggregation like cardinality or sum on fields over more than number of million documents to get right result not any approximation ?
On a short note, I also tried splitting my index with time. But scripted_metric gives error while it get in total more than 1 million documents from all indices.

Mark_Harwood · January 6, 2022, 3:03pm

This is what we call the Fast-Accurate-Big trilemma.
You can only have two of the three.

Elasticsearch aggregations are optimised for Fast and Big so use a single-pass over distributed data using approximation algorithms like that in the cardinality agg. Accuracy is the trade off.

Alternatively you can make multiple calls to page through the composite aggregation and count the number of results in your client. This will be BA - working with Big data and Accurate results but not especially Fast

Nowrin_Hossain · January 10, 2022, 8:41am

But composite aggregation doesn't work when I need sub aggregation.

{
  "aggs": {
    "by_field1": {
      "terms": {
        "field": "field1",
        "size": 100
      },
      "aggs": {
        "by_field2": {
          "terms": {
            "field": "field2",
            "size": 25
          },
          "aggs": {
            "distinct_values": {   # this distinct value can be more than million
              "composite": {
                "size": 65535,
                "sources": [
                  {
                    "product": {
                      "terms": {
                        "field": "field3"
                      }
                    }
                  }
                ]
              }
            }
          }
        }
      }
    }
  },
  "size": 0
}

In the "distinct_values" sub aggregation I need to return number of unique values which can be more than million.

Mark_Harwood · January 10, 2022, 9:12am

I don’t understand. The cardinality agg doesn’t allow sub aggregations but composite does.

I suggested making multiple calls with the composite agg and it’s facilities for paging so you could count the unique values in your client as opposed to making a single request with the cardinality aggregation.

What is your business problem that requires 100% accuracy? Sometimes it takes so long to calculate accurate values the result is out of date (and therefore inaccurate) by the time you get it.

Nowrin_Hossain · January 10, 2022, 9:30am

Maybe I could not make clear my requirement.

This is my query.

{
  "aggs": {
    "by_field1": {
      "terms": {
        "field": "field1",
        "size": 100
      },
      "aggs": {
        "by_field2": {
          "terms": {
            "field": "field2",
            "size": 25
          },
          "aggs": {
            **"distinct_values": {   ** #this distinct value can be more than million.
**              "composite": { #  previously I used cardinality and scripted_metric aggregation here.**
**                "size": 65535,**
**                "sources": [**
**                  {**
**                    "product": {**
**                      "terms": {**
**                        "field": "field3"**
**                      }**
**                    }**
**                  }**
**                ]**
**              }**
            }
          }
        }
      }
    }
  },
  "size": 0
}

In this query, the "distinct_values" scope should return unique number of values of field3 . Number of unique values can be millions. composite aggregation cannot be used under terms aggregation. While I tried to use composite aggregation, I got this error:

"reason": "[composite] aggregation cannot be used with a parent aggregation of type: [TermsAggregatorFactory]",

I need suggestion how can I get unique value count for field3 accurately in any sub aggregation where there may be millions of documents.

Mark_Harwood · January 10, 2022, 9:42am

Ah got it. Composite can't be used as a sub agg rather than can't have sub aggs of its own.

That is a problem. The physics challenge is trying to do this computation efficiently and accurately if you have multiple distributed shards/indexes.

Could you live with a policy where cardinality agg is used to identify using mostly-accurate measures and you drill-down for selected items of interest to get the accurate numbers using a query for the subject of interest and a composite agg at the root to do the calculation I suggested or even a scripted agg if the cardinality is low?
Again, what is the business problem you hope to solve?

system · February 7, 2022, 9:42am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Heavy computational load in scripted metric aggregation Elasticsearch	2	533	February 21, 2020
Script Aggregations gather phase on client node Elasticsearch	1	572	July 5, 2017
Discover Top values "Calculated from 5,000 sample records." Kibana discover	2	915	June 20, 2023
Elasticsearch Aggregation Internal Elasticsearch	1	534	August 1, 2017
How to use arithmetic operation inside bucket aggregation on a very large data containing millions document in elasticsearch? Elasticsearch	7	1843	August 19, 2017

Elasticsearch aggregation on million or more data

Related topics