Significant terms aggregation on large dataset

carasel · August 1, 2019, 2:09pm

I would like to do a significant terms aggregation across a lot of data. I currently have about 1TB of data in about 5 million top level documents, which all have a few hundred nested documents. The field that I want to create a significant terms aggregation on is in the nested documents. The dataset is going to continue to grow.

At the moment my significant terms aggregation is timing out, due to the AWS Elasticsearch Service hard limit of 60 seconds, but I can see that the task is running for about 90 seconds. This is even when I am using partitions. I don't mind this aggregation taking a long time (as long as my cluster is available for other queries), as it will only happen a few times per day and is not user-facing, but would provide a lot of value. I also don't mind moving from AWS Elasticsearch Service, or increasing the RAM on my machines, if that is what is needed.

Mostly I am wondering if this is just a stupid thing to attempt. If it's not stupid, what do I likely need to do to make it work?

system · August 29, 2019, 2:09pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Significant terms aggregation too slow for me Elasticsearch	2	506	July 6, 2017
Elasticsearch terms aggregation taking 5 seconds on 5 million documents Elasticsearch	7	2001	August 19, 2019
Optimizations for nested aggregations Elasticsearch	5	1906	July 6, 2017
Significant_terms aggregation with sampling Elasticsearch	2	216	January 20, 2023
Deeply nested aggregations with large dataset Elasticsearch	4	1259	March 7, 2018

Significant terms aggregation on large dataset

Related topics