Aggregation on multiple fields with millions of buckets

Hi

I have a requirement where in i need to aggregate over multiple fields which can result in millions of buckets.

We have data with millions of records, and here i need to get average number of records for each unique combination of 3 columns - FirstName, MiddleName, LastName.

Or you can say the frequency for each unique combination of FirstName, MiddleName and LastName.

When i try to use the terms aggregation over these 3 fields, got too_many_buckets_exception exception, as the default bucket size is 10k. Increased it to 100k, it worked but i think it's not the right way performance wise. It worked for the current sample of data, but the bucket size may go to millions.

Can you please suggest a way to achieve this.

Citing below the mappings, and search query for reference.

"mappings" : {
  "_meta" : {
    "created_by" : "ml-file-data-visualizer"
  },
  "properties" : {
    "FirstName" : {
      "type" : "keyword"
    },
    "MiddleName" : {
      "type" : "keyword"
    },
    "LastName" : {
      "type" : "keyword"
    }
  }
}

Search Query -

GET /names/_search
{
  "size": 0,
  "aggs": {
    "First_Level": {
      "terms": {
        "field": "FirstName",
        "size": 1000,
        "min_doc_count": 1
      },
      "aggs": {
        "Second_Level": {
          "terms": {
            "field": "MiddleName",
            "size": "1000",
            "min_doc_count": 1
          },
          "aggs": {
            "Third_Level": {
              "terms": {
                "field": "LastName",
                "size": "1000",
                "min_doc_count": 1
              }
            }
          }
        }
      }
    }
  }
}

Thanks.

What is the lifecycle of a document? Can they be updated or deleted?

Is this something you need to calculate frequently?

Hi Christian

This index is just created once, for the purpose of calculating the frequency based on multiple fields.
No updates/deletes will be performed on this index. And once we are able to get the desired output, this index will be permanently dropped.

Thanks

Look into Transforms. It uses composite aggregations under the covers but you don't run into bucket size problems.

1 Like

Thanks for the update, but can't use transforms in production as its still in beta phase.

Can you please suggest a way to add a new field to an index which is based on an existing field.

I have an index with 10 million names.
Want to add a new field which is substring of existing name field.

It's also fine if i can create a new index for this.

Thanks

Just FYI - Transforms is GA in v7.7 which should be out very soon.

An alternative approach is to re-index the original index into a new index and use a painless script to create a new field from existing fields. For example:

POST _reindex
{
  "source": {
    "index": "names"
  },
  "dest": {
    "index": "full_names"
    
  },
  "script": {
    "source": "ctx._source.FullName= ctx._source.FirstName + '  ' +  ctx._source.LastName",
    "lang": "painless"
  }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.