Sub aggregations on aggregations with 'limited' results (e.g. terms)

Ollie · July 22, 2014, 10:57am

Hi,

I have a question about sub aggregations. We're using a number of terms
aggregations on some high cardinality fields, returning the top 50 results
(as set using size) in each case. We also have a cardinality
sub-aggregation on each of the terms aggregations to get the number of
unique users (a separate field) for each term returned.

We are wondering if the cardinality aggregation is executed for every
possible term found by the terms aggregation, or only the top 50 terms? We
are seeing very high memory usage and getting out of memory errors when
running this, and it's not clear from the documentation what's going on
under the hood. A cardinality aggregation on every single possible term
would go some way towards explaining things.

Is there a more efficient way of running these queries?

Thanks in advance,

Ollie

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/02eb67d1-73a1-4a7e-8e76-d4c48525360e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mark_Harwood_2 · July 22, 2014, 12:51pm

Hi Ollie,

In the next release there is a new option to cater for this scenario. We
introduce a "collect_mode" that allows a new "breadth_first" setting on
terms aggregations which explores all of the buckets at that level and then
prunes to the top N (in your case, 50) before flowing down matches to the
child aggregations. The default mode of operation is the current
"depth_first" approach which can produce many buckets in rare cases like
the one you have encountered.

In the interim, you can do this as two requests from your client. One gets
the top 50 top-level terms then issues a second query filtered by these
selections to go get child aggs.

Cheers
Mark

On Tuesday, July 22, 2014 11:57:21 AM UTC+1, Ollie wrote:

Hi,

I have a question about sub aggregations. We're using a number of terms
aggregations on some high cardinality fields, returning the top 50 results
(as set using size) in each case. We also have a cardinality
sub-aggregation on each of the terms aggregations to get the number of
unique users (a separate field) for each term returned.

We are wondering if the cardinality aggregation is executed for every
possible term found by the terms aggregation, or only the top 50 terms? We
are seeing very high memory usage and getting out of memory errors when
running this, and it's not clear from the documentation what's going on
under the hood. A cardinality aggregation on every single possible term
would go some way towards explaining things.

Is there a more efficient way of running these queries?

Thanks in advance,

Ollie

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/881ec5b9-b490-493c-a02a-33dc5e7e980f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ollie · July 22, 2014, 2:13pm

Hi Mark,

Thanks for the response. That makes sense, and I've now found the relevant
bit in the documentation now –

for anyone else who's reading this discussion.

Any idea when 1.3 is likely to hit?

Ollie

On Tuesday, July 22, 2014 1:51:59 PM UTC+1, Mark Harwood wrote:

Hi Ollie,

In the next release there is a new option to cater for this scenario. We
introduce a "collect_mode" that allows a new "breadth_first" setting on
terms aggregations which explores all of the buckets at that level and then
prunes to the top N (in your case, 50) before flowing down matches to the
child aggregations. The default mode of operation is the current
"depth_first" approach which can produce many buckets in rare cases like
the one you have encountered.

In the interim, you can do this as two requests from your client. One gets
the top 50 top-level terms then issues a second query filtered by these
selections to go get child aggs.

Cheers
Mark

On Tuesday, July 22, 2014 11:57:21 AM UTC+1, Ollie wrote:

Hi,

I have a question about sub aggregations. We're using a number of terms
aggregations on some high cardinality fields, returning the top 50 results
(as set using size) in each case. We also have a cardinality
sub-aggregation on each of the terms aggregations to get the number of
unique users (a separate field) for each term returned.

We are wondering if the cardinality aggregation is executed for every
possible term found by the terms aggregation, or only the top 50 terms? We
are seeing very high memory usage and getting out of memory errors when
running this, and it's not clear from the documentation what's going on
under the hood. A cardinality aggregation on every single possible term
would go some way towards explaining things.

Is there a more efficient way of running these queries?

Thanks in advance,

Ollie

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d6f4a90d-458b-40d1-9faa-1ef16bedfe53%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mark_Harwood_2 · July 22, 2014, 3:03pm

Thanks for the response. That makes sense, and I've now found the relevant
bit in the documentation now –
Elasticsearch Platform — Find real-time answers at scale | Elastic
for anyone else who's reading this discussion.

Well found, sir.

Any idea when 1.3 is likely to hit?

I knew that would be your next question
Forgive me if I'm deliberately vague but it is definitely "sooner rather
than later".

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/19671ae4-da86-4e57-9bbb-d4cd64d7ad62%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Inconsistent aggregation results Elasticsearch	1	423	December 5, 2019
Help: Flattened aggregations (with limiting and sorting) Elasticsearch	4	1560	July 6, 2017
Out Of Memory error on cardinality aggregation Elasticsearch	2	1268	July 5, 2017
Nested Terms Aggregation performance issue? Elasticsearch	1	627	July 6, 2017
How to sub aggregate with 1 depth term aggs's doc_count? Elasticsearch	1	361	July 6, 2017

Sub aggregations on aggregations with 'limited' results (e.g. terms)

Related topics