Elasticsearch aggregation on million or more data

I have an index which can have millions of documents with time. I need to make several aggregation on those huge amount of documents fields. In these documents, I need to count number of unique value of a field which may grow large with number of documents like more than million. I know "cardinality" aggregation. But this doesn't give right count, this gives approximation. Then I tried to used scripted_metric.

{
"aggs": {
                            "distinct_count": {
                                "scripted_metric": {
                                    "params": {
                                        "fieldName": "some_field_name"
                                    },
                                    "init_script": "state.list = []",
                                    "map_script": "if(doc[params.fieldName] != null)state.list.add(doc[params.fieldName].value);",
                                    "combine_script": "return state.list;",
                                    "reduce_script": "Map uniqueValueMap = new HashMap(); int count = 0;for(shardList in states) {if(shardList != null) { for(key in shardList) {if(!uniqueValueMap.containsKey(key)) {count +=1;uniqueValueMap.put(key, key);}}}} return count;"
                                }
                            }
                        }
}

But this gives me right value only upto 1 million data. When number of documents get larger, this gives error.
So, what is the right way to make aggregation like cardinality or sum on fields over more than number of million documents to get right result not any approximation ?
On a short note, I also tried splitting my index with time. But scripted_metric gives error while it get in total more than 1 million documents from all indices.

This is what we call the Fast-Accurate-Big trilemma.
You can only have two of the three.

Elasticsearch aggregations are optimised for Fast and Big so use a single-pass over distributed data using approximation algorithms like that in the cardinality agg. Accuracy is the trade off.

Alternatively you can make multiple calls to page through the composite aggregation and count the number of results in your client. This will be BA - working with Big data and Accurate results but not especially Fast

2 Likes

But composite aggregation doesn't work when I need sub aggregation.

{
  "aggs": {
    "by_field1": {
      "terms": {
        "field": "field1",
        "size": 100
      },
      "aggs": {
        "by_field2": {
          "terms": {
            "field": "field2",
            "size": 25
          },
          "aggs": {
            "distinct_values": {   # this distinct value can be more than million
              "composite": {
                "size": 65535,
                "sources": [
                  {
                    "product": {
                      "terms": {
                        "field": "field3"
                      }
                    }
                  }
                ]
              }
            }
          }
        }
      }
    }
  },
  "size": 0
}

In the "distinct_values" sub aggregation I need to return number of unique values which can be more than million.

I don’t understand. The cardinality agg doesn’t allow sub aggregations but composite does.

I suggested making multiple calls with the composite agg and it’s facilities for paging so you could count the unique values in your client as opposed to making a single request with the cardinality aggregation.

What is your business problem that requires 100% accuracy? Sometimes it takes so long to calculate accurate values the result is out of date (and therefore inaccurate) by the time you get it.

1 Like

Maybe I could not make clear my requirement.

This is my query.

{
  "aggs": {
    "by_field1": {
      "terms": {
        "field": "field1",
        "size": 100
      },
      "aggs": {
        "by_field2": {
          "terms": {
            "field": "field2",
            "size": 25
          },
          "aggs": {
            **"distinct_values": {   ** #this distinct value can be more than million.
**              "composite": { #  previously I used cardinality and scripted_metric aggregation here.**
**                "size": 65535,**
**                "sources": [**
**                  {**
**                    "product": {**
**                      "terms": {**
**                        "field": "field3"**
**                      }**
**                    }**
**                  }**
**                ]**
**              }**
            }
          }
        }
      }
    }
  },
  "size": 0
}

In this query, the "distinct_values" scope should return unique number of values of field3 . Number of unique values can be millions. composite aggregation cannot be used under terms aggregation. While I tried to use composite aggregation, I got this error:

"reason": "[composite] aggregation cannot be used with a parent aggregation of type: [TermsAggregatorFactory]",

I need suggestion how can I get unique value count for field3 accurately in any sub aggregation where there may be millions of documents.

Ah got it. Composite can't be used as a sub agg rather than can't have sub aggs of its own.

That is a problem. The physics challenge is trying to do this computation efficiently and accurately if you have multiple distributed shards/indexes.

Could you live with a policy where cardinality agg is used to identify using mostly-accurate measures and you drill-down for selected items of interest to get the accurate numbers using a query for the subject of interest and a composite agg at the root to do the calculation I suggested or even a scripted agg if the cardinality is low?
Again, what is the business problem you hope to solve?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.