Sub aggregations on aggregations with 'limited' results (e.g. terms)


(Ollie) #1

Hi,

I have a question about sub aggregations. We're using a number of terms
aggregations on some high cardinality fields, returning the top 50 results
(as set using size) in each case. We also have a cardinality
sub-aggregation on each of the terms aggregations to get the number of
unique users (a separate field) for each term returned.

We are wondering if the cardinality aggregation is executed for every
possible term found by the terms aggregation, or only the top 50 terms? We
are seeing very high memory usage and getting out of memory errors when
running this, and it's not clear from the documentation what's going on
under the hood. A cardinality aggregation on every single possible term
would go some way towards explaining things.

Is there a more efficient way of running these queries?

Thanks in advance,

Ollie

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/02eb67d1-73a1-4a7e-8e76-d4c48525360e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Harwood-2) #2

Hi Ollie,

In the next release there is a new option to cater for this scenario. We
introduce a "collect_mode" that allows a new "breadth_first" setting on
terms aggregations which explores all of the buckets at that level and then
prunes to the top N (in your case, 50) before flowing down matches to the
child aggregations. The default mode of operation is the current
"depth_first" approach which can produce many buckets in rare cases like
the one you have encountered.

In the interim, you can do this as two requests from your client. One gets
the top 50 top-level terms then issues a second query filtered by these
selections to go get child aggs.

Cheers
Mark

On Tuesday, July 22, 2014 11:57:21 AM UTC+1, Ollie wrote:

Hi,

I have a question about sub aggregations. We're using a number of terms
aggregations on some high cardinality fields, returning the top 50 results
(as set using size) in each case. We also have a cardinality
sub-aggregation on each of the terms aggregations to get the number of
unique users (a separate field) for each term returned.

We are wondering if the cardinality aggregation is executed for every
possible term found by the terms aggregation, or only the top 50 terms? We
are seeing very high memory usage and getting out of memory errors when
running this, and it's not clear from the documentation what's going on
under the hood. A cardinality aggregation on every single possible term
would go some way towards explaining things.

Is there a more efficient way of running these queries?

Thanks in advance,

Ollie

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/881ec5b9-b490-493c-a02a-33dc5e7e980f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Ollie) #3

Hi Mark,

Thanks for the response. That makes sense, and I've now found the relevant
bit in the documentation now –
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_collect_mode
for anyone else who's reading this discussion.

Any idea when 1.3 is likely to hit?

Ollie

On Tuesday, July 22, 2014 1:51:59 PM UTC+1, Mark Harwood wrote:

Hi Ollie,

In the next release there is a new option to cater for this scenario. We
introduce a "collect_mode" that allows a new "breadth_first" setting on
terms aggregations which explores all of the buckets at that level and then
prunes to the top N (in your case, 50) before flowing down matches to the
child aggregations. The default mode of operation is the current
"depth_first" approach which can produce many buckets in rare cases like
the one you have encountered.

In the interim, you can do this as two requests from your client. One gets
the top 50 top-level terms then issues a second query filtered by these
selections to go get child aggs.

Cheers
Mark

On Tuesday, July 22, 2014 11:57:21 AM UTC+1, Ollie wrote:

Hi,

I have a question about sub aggregations. We're using a number of terms
aggregations on some high cardinality fields, returning the top 50 results
(as set using size) in each case. We also have a cardinality
sub-aggregation on each of the terms aggregations to get the number of
unique users (a separate field) for each term returned.

We are wondering if the cardinality aggregation is executed for every
possible term found by the terms aggregation, or only the top 50 terms? We
are seeing very high memory usage and getting out of memory errors when
running this, and it's not clear from the documentation what's going on
under the hood. A cardinality aggregation on every single possible term
would go some way towards explaining things.

Is there a more efficient way of running these queries?

Thanks in advance,

Ollie

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d6f4a90d-458b-40d1-9faa-1ef16bedfe53%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Harwood-2) #4

Thanks for the response. That makes sense, and I've now found the relevant
bit in the documentation now –
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_collect_mode
for anyone else who's reading this discussion.

Well found, sir.

Any idea when 1.3 is likely to hit?

I knew that would be your next question :slight_smile:
Forgive me if I'm deliberately vague but it is definitely "sooner rather
than later".

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/19671ae4-da86-4e57-9bbb-d4cd64d7ad62%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #5