Aggregation on multiple fields with millions of buckets

Manish_Kukreja · April 10, 2020, 12:42pm

Hi

I have a requirement where in i need to aggregate over multiple fields which can result in millions of buckets.

We have data with millions of records, and here i need to get average number of records for each unique combination of 3 columns - FirstName, MiddleName, LastName.

Or you can say the frequency for each unique combination of FirstName, MiddleName and LastName.

When i try to use the terms aggregation over these 3 fields, got too_many_buckets_exception exception, as the default bucket size is 10k. Increased it to 100k, it worked but i think it's not the right way performance wise. It worked for the current sample of data, but the bucket size may go to millions.

Can you please suggest a way to achieve this.

Citing below the mappings, and search query for reference.

"mappings" : {
  "_meta" : {
    "created_by" : "ml-file-data-visualizer"
  },
  "properties" : {
    "FirstName" : {
      "type" : "keyword"
    },
    "MiddleName" : {
      "type" : "keyword"
    },
    "LastName" : {
      "type" : "keyword"
    }
  }
}

Search Query -

GET /names/_search
{
  "size": 0,
  "aggs": {
    "First_Level": {
      "terms": {
        "field": "FirstName",
        "size": 1000,
        "min_doc_count": 1
      },
      "aggs": {
        "Second_Level": {
          "terms": {
            "field": "MiddleName",
            "size": "1000",
            "min_doc_count": 1
          },
          "aggs": {
            "Third_Level": {
              "terms": {
                "field": "LastName",
                "size": "1000",
                "min_doc_count": 1
              }
            }
          }
        }
      }
    }
  }
}

Thanks.

Christian_Dahlqvist · April 10, 2020, 12:57pm

What is the lifecycle of a document? Can they be updated or deleted?

Is this something you need to calculate frequently?

Manish_Kukreja · April 11, 2020, 5:32pm

Hi Christian

This index is just created once, for the purpose of calculating the frequency based on multiple fields.
No updates/deletes will be performed on this index. And once we are able to get the desired output, this index will be permanently dropped.

Thanks

richcollier · April 11, 2020, 6:12pm

Look into Transforms. It uses composite aggregations under the covers but you don't run into bucket size problems.

Manish_Kukreja · April 15, 2020, 3:37pm

Thanks for the update, but can't use transforms in production as its still in beta phase.

Can you please suggest a way to add a new field to an index which is based on an existing field.

I have an index with 10 million names.
Want to add a new field which is substring of existing name field.

It's also fine if i can create a new index for this.

Thanks

richcollier · April 15, 2020, 5:03pm

Just FYI - Transforms is GA in v7.7 which should be out very soon.

An alternative approach is to re-index the original index into a new index and use a painless script to create a new field from existing fields. For example:

POST _reindex
{
  "source": {
    "index": "names"
  },
  "dest": {
    "index": "full_names"
    
  },
  "script": {
    "source": "ctx._source.FullName= ctx._source.FirstName + '  ' +  ctx._source.LastName",
    "lang": "painless"
  }
}

system · May 13, 2020, 5:03pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Terms Aggregations Elasticsearch	11	627	July 6, 2017
Aggregation of two fields in single query Elasticsearch	6	8245	October 24, 2019
Aggregate over multiple keys in the document, without knowing the key details Elasticsearch	7	2418	December 25, 2018
Composite aggregation on runtime field in 6.2 Elasticsearch painless	1	531	August 15, 2021
Questions about terms aggregations and too many buckets exception Elasticsearch	1	322	January 25, 2023

Aggregation on multiple fields with millions of buckets

Related topics