Large results sets and paging for Aggregations

Based on the reference docs I couldn't figure out what happens when the
aggregation result set is very large. Does it get cut off? What is the
upperbound? Does ES crash?

I see closed issues that indicate that pagination for aggregations will not
be supported (https://github.com/elasticsearch/elasticsearch/issues/4915)
BUT does that mean we can still get the entire result set without missing
anything in the response?

Is the best way to do this, via a scroll (non-scan) query and that will
give the entire aggregation result set in the very first response, no
matter how huge?

Thanks!

  • Pulkit

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/57cb8dd9-aa54-4777-b9ae-8ee0495ecba3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sharing a response I received from Igor Motov:

"scroll works only to page results. paging aggs doesn't make sense since

aggs are executed on the entire result set. therefore if it managed to fit
into the memory you should just get it. paging will mean that you throw
away a lot of results that were already calculated. the only way to "page"
is by limiting the results that you are running aggs on. for example if
your data is sorted by date and you want to build histogram for the results
one date range at a time."

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0ae54bc1-7059-4ae7-a979-191a64d068fd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Well, my use case says I have tens of thousands of records for each
members. I want to do a simple terms aggs on member ID. If my count of
member ID remains same throughout .. good enough, if the number of members
keep on increasing, day by day ES has to keep more and more data into
memory to calculate the aggs. Does not sound very promising. What we do is
implementation of routing to put member specific data into a particular
shard. Why can't aggs be based on shard based calculations so that I am
safe from loading tons of data into memory.

Any thoughts?

On Sunday, 9 November 2014 22:58:12 UTC+5:30, pulkitsinghal wrote:

Sharing a response I received from Igor Motov:

"scroll works only to page results. paging aggs doesn't make sense since

aggs are executed on the entire result set. therefore if it managed to fit
into the memory you should just get it. paging will mean that you throw
away a lot of results that were already calculated. the only way to "page"
is by limiting the results that you are running aggs on. for example if
your data is sorted by date and you want to build histogram for the results
one date range at a time."

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f6307a18-ea96-403d-ac02-dc37d3f2cceb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Why can't aggs be based on shard based calculations

They are. The "shard_size" setting will determine how many member
summaries will be returned from each shard - we won't stream each
member's thousands of related records back to a centralized point to
compute a final result. The final step is to summarise the summaries from
each shard.

if the number of members keep on increasing, day by day ES has to keep
more and more data into memory to calculate the aggs

This is a different point to the one above (shard-level computation vs
memory costs). If your analysis involves summarising the behaviours of
large numbers of people over time then you may well find the cost of doing
this in a single query too high when the numbers of people are extremely
large. There is a cost to any computation and in that scenario you have
deferred all these member-summarising costs to the very last moment. A
better strategy for large-scale analysis of behaviours over time is to use
a "pay-as-you-go" model where you update a per-member summary document at
regular intervals with batches of their related records. This shifts the
bulk of the computation cost from your single query to many smaller costs
when writing data. You can then perform efficient aggs or scan/scroll
operations on member documents with pre-summarised attributes e.g.
totalSpend rather than deriving these properties on-the-fly from records
with a shared member ID.

On Tuesday, February 10, 2015 at 7:03:17 AM UTC, piyush goyal wrote:

Well, my use case says I have tens of thousands of records for each
members. I want to do a simple terms aggs on member ID. If my count of
member ID remains same throughout .. good enough, if the number of members
keep on increasing, day by day ES has to keep more and more data into
memory to calculate the aggs. Does not sound very promising. What we do is
implementation of routing to put member specific data into a particular
shard. Why can't aggs be based on shard based calculations so that I am
safe from loading tons of data into memory.

Any thoughts?

On Sunday, 9 November 2014 22:58:12 UTC+5:30, pulkitsinghal wrote:

Sharing a response I received from Igor Motov:

"scroll works only to page results. paging aggs doesn't make sense since

aggs are executed on the entire result set. therefore if it managed to fit
into the memory you should just get it. paging will mean that you throw
away a lot of results that were already calculated. the only way to "page"
is by limiting the results that you are running aggs on. for example if
your data is sorted by date and you want to build histogram for the results
one date range at a time."

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/486fc700-a89f-473f-a6c6-4e69e862766f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Thanks Mark. Your suggestion of "pay-as-you-go" seems amazing. But
considering the dynamics of the application, these kind of queries are hit
more for qualitative analysis. There are hundred of such queries(I am not
exaggerating) which are being hit daily by our analytic team. Keeping count
of all those qualitative checks daily and maintaining them as documents is
a headache itself. Addition/update/removals of these documents would cause
us huge maintenance overheads. Hence was thinking of getting something of
getting pagination on aggregations which would definitely help us to keep
our ES memory leaks away.

By the way, are there any other strategies suggested by ES for these kind
of scenarios?

Thanks

On Tuesday, 10 February 2015 15:20:40 UTC+5:30, Mark Harwood wrote:

Why can't aggs be based on shard based calculations

They are. The "shard_size" setting will determine how many member
summaries will be returned from each shard - we won't stream each
member's thousands of related records back to a centralized point to
compute a final result. The final step is to summarise the summaries from
each shard.

if the number of members keep on increasing, day by day ES has to keep
more and more data into memory to calculate the aggs

This is a different point to the one above (shard-level computation vs
memory costs). If your analysis involves summarising the behaviours of
large numbers of people over time then you may well find the cost of doing
this in a single query too high when the numbers of people are extremely
large. There is a cost to any computation and in that scenario you have
deferred all these member-summarising costs to the very last moment. A
better strategy for large-scale analysis of behaviours over time is to use
a "pay-as-you-go" model where you update a per-member summary document at
regular intervals with batches of their related records. This shifts the
bulk of the computation cost from your single query to many smaller costs
when writing data. You can then perform efficient aggs or scan/scroll
operations on member documents with pre-summarised attributes e.g.
totalSpend rather than deriving these properties on-the-fly from records
with a shared member ID.

On Tuesday, February 10, 2015 at 7:03:17 AM UTC, piyush goyal wrote:

Well, my use case says I have tens of thousands of records for each
members. I want to do a simple terms aggs on member ID. If my count of
member ID remains same throughout .. good enough, if the number of members
keep on increasing, day by day ES has to keep more and more data into
memory to calculate the aggs. Does not sound very promising. What we do is
implementation of routing to put member specific data into a particular
shard. Why can't aggs be based on shard based calculations so that I am
safe from loading tons of data into memory.

Any thoughts?

On Sunday, 9 November 2014 22:58:12 UTC+5:30, pulkitsinghal wrote:

Sharing a response I received from Igor Motov:

"scroll works only to page results. paging aggs doesn't make sense since

aggs are executed on the entire result set. therefore if it managed to fit
into the memory you should just get it. paging will mean that you throw
away a lot of results that were already calculated. the only way to "page"
is by limiting the results that you are running aggs on. for example if
your data is sorted by date and you want to build histogram for the results
one date range at a time."

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d4b5fd32-3ef7-4026-846e-5f7d388bad1f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

these kind of queries are hit more for qualitative analysis.

Do you have any example queries? The "pay as you go" summarisation need not
be about just maintaining quantities. In the demo here [1] I derive
"profile" names for people, categorizing them as "newbies", "fanboys" or
"haters" based on a history of their reviewing behaviours in a marketplace.

By the way, are there any other strategies suggested by ES for these kind
of scenarios?

Igor hit on one which is to use some criteria eg. date to limit the volume
of what you analyze in any one query request.

[1] http://www.elasticsearch.org/videos/entity-centric-indexing-london-meetup-sep-2014/

On Tuesday, February 10, 2015 at 10:05:24 AM UTC, piyush goyal wrote:

Thanks Mark. Your suggestion of "pay-as-you-go" seems amazing. But
considering the dynamics of the application, these kind of queries are hit
more for qualitative analysis. There are hundred of such queries(I am not
exaggerating) which are being hit daily by our analytic team. Keeping count
of all those qualitative checks daily and maintaining them as documents is
a headache itself. Addition/update/removals of these documents would cause
us huge maintenance overheads. Hence was thinking of getting something of
getting pagination on aggregations which would definitely help us to keep
our ES memory leaks away.

By the way, are there any other strategies suggested by ES for these kind
of scenarios?

Thanks

On Tuesday, 10 February 2015 15:20:40 UTC+5:30, Mark Harwood wrote:

Why can't aggs be based on shard based calculations

They are. The "shard_size" setting will determine how many member
summaries will be returned from each shard - we won't stream each
member's thousands of related records back to a centralized point to
compute a final result. The final step is to summarise the summaries from
each shard.

if the number of members keep on increasing, day by day ES has to keep
more and more data into memory to calculate the aggs

This is a different point to the one above (shard-level computation vs
memory costs). If your analysis involves summarising the behaviours of
large numbers of people over time then you may well find the cost of doing
this in a single query too high when the numbers of people are extremely
large. There is a cost to any computation and in that scenario you have
deferred all these member-summarising costs to the very last moment. A
better strategy for large-scale analysis of behaviours over time is to use
a "pay-as-you-go" model where you update a per-member summary document at
regular intervals with batches of their related records. This shifts the
bulk of the computation cost from your single query to many smaller costs
when writing data. You can then perform efficient aggs or scan/scroll
operations on member documents with pre-summarised attributes e.g.
totalSpend rather than deriving these properties on-the-fly from records
with a shared member ID.

On Tuesday, February 10, 2015 at 7:03:17 AM UTC, piyush goyal wrote:

Well, my use case says I have tens of thousands of records for each
members. I want to do a simple terms aggs on member ID. If my count of
member ID remains same throughout .. good enough, if the number of members
keep on increasing, day by day ES has to keep more and more data into
memory to calculate the aggs. Does not sound very promising. What we do is
implementation of routing to put member specific data into a particular
shard. Why can't aggs be based on shard based calculations so that I am
safe from loading tons of data into memory.

Any thoughts?

On Sunday, 9 November 2014 22:58:12 UTC+5:30, pulkitsinghal wrote:

Sharing a response I received from Igor Motov:

"scroll works only to page results. paging aggs doesn't make sense

since aggs are executed on the entire result set. therefore if it managed
to fit into the memory you should just get it. paging will mean that you
throw away a lot of results that were already calculated. the only way to
"page" is by limiting the results that you are running aggs on. for example
if your data is sorted by date and you want to build histogram for the
results one date range at a time."

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b8ddcc91-a1c8-472e-b08c-f662313a042a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Mark,

Before getting into queries, here is a little bit info about the project:

1.) A community where members keep on increasing, decreasing and changing.
Maintained in a different type.
2.) Approximately 3K to 4K documents of data of each user inserted into ES
per month in a different type maintained by member ID.
3.) Mapping is flat, there are no nested and array type of data.

Requirement:

Here is a sample requirement:

1.) Getting a report against each member ID against the count of data for
last three month.
2.) Query used to get the data is:

{
"query": {
"constant_score": {
"filter": {
"bool": {
"must": [
{"term": {
"datatype": "XYZ"
}
}, {
"range": {
"response_timestamp": {
"from": "2014-11-01",
"to": "2015-01-31"
}
}
}
]
}
}
}
},"aggs": {
"memberIDAggs": {
"terms": {
"field": "member_id",
"size": 0
},"aggs": {
"dateHistAggs": {
"date_histogram": {
"field": "response_timestamp",
"interval": "month"
}
}
}
}
},"size": 0
}

Now since the current member count is approximately 1K which will increase
to 5K in next 10 months. 5K * 4K * 3 times of documents to be used for this
aggregation. I guess a major hit on system. And this is only two level of
aggregation. Next requirement by our analyst is to get per month data into
three different categories.

What is the optimum solution to this problem?

Regards
Piyush

On Tuesday, 10 February 2015 16:15:22 UTC+5:30, Mark Harwood wrote:

these kind of queries are hit more for qualitative analysis.

Do you have any example queries? The "pay as you go" summarisation need
not be about just maintaining quantities. In the demo here [1] I derive
"profile" names for people, categorizing them as "newbies", "fanboys" or
"haters" based on a history of their reviewing behaviours in a marketplace.

By the way, are there any other strategies suggested by ES for these kind
of scenarios?

Igor hit on one which is to use some criteria eg. date to limit the volume
of what you analyze in any one query request.

[1]
http://www.elasticsearch.org/videos/entity-centric-indexing-london-meetup-sep-2014/

On Tuesday, February 10, 2015 at 10:05:24 AM UTC, piyush goyal wrote:

Thanks Mark. Your suggestion of "pay-as-you-go" seems amazing. But
considering the dynamics of the application, these kind of queries are hit
more for qualitative analysis. There are hundred of such queries(I am not
exaggerating) which are being hit daily by our analytic team. Keeping count
of all those qualitative checks daily and maintaining them as documents is
a headache itself. Addition/update/removals of these documents would cause
us huge maintenance overheads. Hence was thinking of getting something of
getting pagination on aggregations which would definitely help us to keep
our ES memory leaks away.

By the way, are there any other strategies suggested by ES for these kind
of scenarios?

Thanks

On Tuesday, 10 February 2015 15:20:40 UTC+5:30, Mark Harwood wrote:

Why can't aggs be based on shard based calculations

They are. The "shard_size" setting will determine how many member
summaries will be returned from each shard - we won't stream each
member's thousands of related records back to a centralized point to
compute a final result. The final step is to summarise the summaries from
each shard.

if the number of members keep on increasing, day by day ES has to keep
more and more data into memory to calculate the aggs

This is a different point to the one above (shard-level computation vs
memory costs). If your analysis involves summarising the behaviours of
large numbers of people over time then you may well find the cost of doing
this in a single query too high when the numbers of people are extremely
large. There is a cost to any computation and in that scenario you have
deferred all these member-summarising costs to the very last moment. A
better strategy for large-scale analysis of behaviours over time is to use
a "pay-as-you-go" model where you update a per-member summary document at
regular intervals with batches of their related records. This shifts the
bulk of the computation cost from your single query to many smaller costs
when writing data. You can then perform efficient aggs or scan/scroll
operations on member documents with pre-summarised attributes e.g.
totalSpend rather than deriving these properties on-the-fly from records
with a shared member ID.

On Tuesday, February 10, 2015 at 7:03:17 AM UTC, piyush goyal wrote:

Well, my use case says I have tens of thousands of records for each
members. I want to do a simple terms aggs on member ID. If my count of
member ID remains same throughout .. good enough, if the number of members
keep on increasing, day by day ES has to keep more and more data into
memory to calculate the aggs. Does not sound very promising. What we do is
implementation of routing to put member specific data into a particular
shard. Why can't aggs be based on shard based calculations so that I am
safe from loading tons of data into memory.

Any thoughts?

On Sunday, 9 November 2014 22:58:12 UTC+5:30, pulkitsinghal wrote:

Sharing a response I received from Igor Motov:

"scroll works only to page results. paging aggs doesn't make sense

since aggs are executed on the entire result set. therefore if it managed
to fit into the memory you should just get it. paging will mean that you
throw away a lot of results that were already calculated. the only way to
"page" is by limiting the results that you are running aggs on. for example
if your data is sorted by date and you want to build histogram for the
results one date range at a time."

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/52742b60-0d60-4a31-a526-a4f0ce404919%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

5k doesn't sound too scary.

Think of the aggs tree like a "Bean Machine" [1] - one of those wooden
boards with pins arranged on it like a christmas tree and you drop balls at
the top of the board and they rattle down a choice of path to the bottom.
In the case of aggs, your buckets are the pins and documents are the balls

The memory requirement for processing the agg tree is typically the number
of pins, not the number of balls you drop into the tree as these just fall
out of the bottom of the tree.
So in your case it is 5k members multiplied by 12 months each = 60k unique
buckets, each of which will maintain a counter of how many docs pass
through that point. So you could pass millions or billions of docs through
and the working memory requirement for the query would be the same.
There is however a fixed overhead for all queries which is a function of
number of docs and that is the Field Data cache required to hold the
dates/member IDs in RAM - if this becomes a problem then you may want to
look at on-disk alternative structure in the form of "DocValues".

Hope that helps.

[1] http://en.wikipedia.org/wiki/Bean_machine

On Wednesday, February 11, 2015 at 7:04:04 AM UTC, piyush goyal wrote:

Hi Mark,

Before getting into queries, here is a little bit info about the project:

1.) A community where members keep on increasing, decreasing and changing.
Maintained in a different type.
2.) Approximately 3K to 4K documents of data of each user inserted into ES
per month in a different type maintained by member ID.
3.) Mapping is flat, there are no nested and array type of data.

Requirement:

Here is a sample requirement:

1.) Getting a report against each member ID against the count of data for
last three month.
2.) Query used to get the data is:

{
"query": {
"constant_score": {
"filter": {
"bool": {
"must": [
{"term": {
"datatype": "XYZ"
}
}, {
"range": {
"response_timestamp": {
"from": "2014-11-01",
"to": "2015-01-31"
}
}
}
]
}
}
}
},"aggs": {
"memberIDAggs": {
"terms": {
"field": "member_id",
"size": 0
},"aggs": {
"dateHistAggs": {
"date_histogram": {
"field": "response_timestamp",
"interval": "month"
}
}
}
}
},"size": 0
}

Now since the current member count is approximately 1K which will increase
to 5K in next 10 months. 5K * 4K * 3 times of documents to be used for this
aggregation. I guess a major hit on system. And this is only two level of
aggregation. Next requirement by our analyst is to get per month data into
three different categories.

What is the optimum solution to this problem?

Regards
Piyush

On Tuesday, 10 February 2015 16:15:22 UTC+5:30, Mark Harwood wrote:

these kind of queries are hit more for qualitative analysis.

Do you have any example queries? The "pay as you go" summarisation need
not be about just maintaining quantities. In the demo here [1] I derive
"profile" names for people, categorizing them as "newbies", "fanboys" or
"haters" based on a history of their reviewing behaviours in a marketplace.

By the way, are there any other strategies suggested by ES for these
kind of scenarios?

Igor hit on one which is to use some criteria eg. date to limit the
volume of what you analyze in any one query request.

[1]
http://www.elasticsearch.org/videos/entity-centric-indexing-london-meetup-sep-2014/

On Tuesday, February 10, 2015 at 10:05:24 AM UTC, piyush goyal wrote:

Thanks Mark. Your suggestion of "pay-as-you-go" seems amazing. But
considering the dynamics of the application, these kind of queries are hit
more for qualitative analysis. There are hundred of such queries(I am not
exaggerating) which are being hit daily by our analytic team. Keeping count
of all those qualitative checks daily and maintaining them as documents is
a headache itself. Addition/update/removals of these documents would cause
us huge maintenance overheads. Hence was thinking of getting something of
getting pagination on aggregations which would definitely help us to keep
our ES memory leaks away.

By the way, are there any other strategies suggested by ES for these
kind of scenarios?

Thanks

On Tuesday, 10 February 2015 15:20:40 UTC+5:30, Mark Harwood wrote:

Why can't aggs be based on shard based calculations

They are. The "shard_size" setting will determine how many member
summaries will be returned from each shard - we won't stream each
member's thousands of related records back to a centralized point to
compute a final result. The final step is to summarise the summaries from
each shard.

if the number of members keep on increasing, day by day ES has to
keep more and more data into memory to calculate the aggs

This is a different point to the one above (shard-level computation vs
memory costs). If your analysis involves summarising the behaviours of
large numbers of people over time then you may well find the cost of doing
this in a single query too high when the numbers of people are extremely
large. There is a cost to any computation and in that scenario you have
deferred all these member-summarising costs to the very last moment. A
better strategy for large-scale analysis of behaviours over time is to use
a "pay-as-you-go" model where you update a per-member summary document at
regular intervals with batches of their related records. This shifts the
bulk of the computation cost from your single query to many smaller costs
when writing data. You can then perform efficient aggs or scan/scroll
operations on member documents with pre-summarised attributes e.g.
totalSpend rather than deriving these properties on-the-fly from records
with a shared member ID.

On Tuesday, February 10, 2015 at 7:03:17 AM UTC, piyush goyal wrote:

Well, my use case says I have tens of thousands of records for each
members. I want to do a simple terms aggs on member ID. If my count of
member ID remains same throughout .. good enough, if the number of members
keep on increasing, day by day ES has to keep more and more data into
memory to calculate the aggs. Does not sound very promising. What we do is
implementation of routing to put member specific data into a particular
shard. Why can't aggs be based on shard based calculations so that I am
safe from loading tons of data into memory.

Any thoughts?

On Sunday, 9 November 2014 22:58:12 UTC+5:30, pulkitsinghal wrote:

Sharing a response I received from Igor Motov:

"scroll works only to page results. paging aggs doesn't make sense

since aggs are executed on the entire result set. therefore if it managed
to fit into the memory you should just get it. paging will mean that you
throw away a lot of results that were already calculated. the only way to
"page" is by limiting the results that you are running aggs on. for example
if your data is sorted by date and you want to build histogram for the
results one date range at a time."

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2dbd000b-85a6-4488-a709-bd0cc2d42726%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

aah..! This seems to be the best explanation of how aggregation works.
Thanks a ton Mark for that. :slight_smile: Few other questions:

1.) Would I assume that as my document count would increase, the time for
aggregation calculation would as well increase? Reason: Trying to figure
out if bucket creation is at individual shard level, then document count
would happen asynchronously at each shard level thus decreasing the
execution time significantly. Also at shard level, as and when my document
count increases(satisfying the criteria as per query) considering if this
process is linear time, the execution time would increase.

2.) How would I relate this analogy with sub aggregations. My observation
says that as you increase the number of child aggregations, so it increases
the execution time along with memory utilization. What happens in case of
sub aggregations?

3.) I didn't get your last statement:
"There is however a fixed overhead for all queries which is a
function of number of docs and that is the Field Data cache required to
hold the dates/member IDs in RAM - if this becomes a problem then you may
want to look at on-disk alternative structure in the form of "DocValues"."

4.) Off the topic, but I guess best to ask it here since we are talking
about it. :slight_smile: - DocValues - Since it was introduced in 1.0.0 and most of our
mapping was defined in ES 0.9, can I change the mapping of existing fields
now? Might be I can take this conversation in another thread but would love
to hear about 1-3 points. You made this thread very interesting for me.

Thanks
Piyush

On Wednesday, 11 February 2015 15:12:37 UTC+5:30, Mark Harwood wrote:

5k doesn't sound too scary.

Think of the aggs tree like a "Bean Machine" [1] - one of those wooden
boards with pins arranged on it like a christmas tree and you drop balls at
the top of the board and they rattle down a choice of path to the bottom.
In the case of aggs, your buckets are the pins and documents are the balls

The memory requirement for processing the agg tree is typically the number
of pins, not the number of balls you drop into the tree as these just fall
out of the bottom of the tree.
So in your case it is 5k members multiplied by 12 months each = 60k unique
buckets, each of which will maintain a counter of how many docs pass
through that point. So you could pass millions or billions of docs through
and the working memory requirement for the query would be the same.
There is however a fixed overhead for all queries which is a function
of number of docs and that is the Field Data cache required to hold the
dates/member IDs in RAM - if this becomes a problem then you may want to
look at on-disk alternative structure in the form of "DocValues".

Hope that helps.

[1] http://en.wikipedia.org/wiki/Bean_machine

On Wednesday, February 11, 2015 at 7:04:04 AM UTC, piyush goyal wrote:

Hi Mark,

Before getting into queries, here is a little bit info about the project:

1.) A community where members keep on increasing, decreasing and
changing. Maintained in a different type.
2.) Approximately 3K to 4K documents of data of each user inserted into
ES per month in a different type maintained by member ID.
3.) Mapping is flat, there are no nested and array type of data.

Requirement:

Here is a sample requirement:

1.) Getting a report against each member ID against the count of data for
last three month.
2.) Query used to get the data is:

{
"query": {
"constant_score": {
"filter": {
"bool": {
"must": [
{"term": {
"datatype": "XYZ"
}
}, {
"range": {
"response_timestamp": {
"from": "2014-11-01",
"to": "2015-01-31"
}
}
}
]
}
}
}
},"aggs": {
"memberIDAggs": {
"terms": {
"field": "member_id",
"size": 0
},"aggs": {
"dateHistAggs": {
"date_histogram": {
"field": "response_timestamp",
"interval": "month"
}
}
}
}
},"size": 0
}

Now since the current member count is approximately 1K which will
increase to 5K in next 10 months. 5K * 4K * 3 times of documents to be used
for this aggregation. I guess a major hit on system. And this is only two
level of aggregation. Next requirement by our analyst is to get per month
data into three different categories.

What is the optimum solution to this problem?

Regards
Piyush

On Tuesday, 10 February 2015 16:15:22 UTC+5:30, Mark Harwood wrote:

these kind of queries are hit more for qualitative analysis.

Do you have any example queries? The "pay as you go" summarisation need
not be about just maintaining quantities. In the demo here [1] I derive
"profile" names for people, categorizing them as "newbies", "fanboys" or
"haters" based on a history of their reviewing behaviours in a marketplace.

By the way, are there any other strategies suggested by ES for these
kind of scenarios?

Igor hit on one which is to use some criteria eg. date to limit the
volume of what you analyze in any one query request.

[1]
http://www.elasticsearch.org/videos/entity-centric-indexing-london-meetup-sep-2014/

On Tuesday, February 10, 2015 at 10:05:24 AM UTC, piyush goyal wrote:

Thanks Mark. Your suggestion of "pay-as-you-go" seems amazing. But
considering the dynamics of the application, these kind of queries are hit
more for qualitative analysis. There are hundred of such queries(I am not
exaggerating) which are being hit daily by our analytic team. Keeping count
of all those qualitative checks daily and maintaining them as documents is
a headache itself. Addition/update/removals of these documents would cause
us huge maintenance overheads. Hence was thinking of getting something of
getting pagination on aggregations which would definitely help us to keep
our ES memory leaks away.

By the way, are there any other strategies suggested by ES for these
kind of scenarios?

Thanks

On Tuesday, 10 February 2015 15:20:40 UTC+5:30, Mark Harwood wrote:

Why can't aggs be based on shard based calculations

They are. The "shard_size" setting will determine how many member
summaries will be returned from each shard - we won't stream each
member's thousands of related records back to a centralized point to
compute a final result. The final step is to summarise the summaries from
each shard.

if the number of members keep on increasing, day by day ES has to
keep more and more data into memory to calculate the aggs

This is a different point to the one above (shard-level computation vs
memory costs). If your analysis involves summarising the behaviours of
large numbers of people over time then you may well find the cost of doing
this in a single query too high when the numbers of people are extremely
large. There is a cost to any computation and in that scenario you have
deferred all these member-summarising costs to the very last moment. A
better strategy for large-scale analysis of behaviours over time is to use
a "pay-as-you-go" model where you update a per-member summary document at
regular intervals with batches of their related records. This shifts the
bulk of the computation cost from your single query to many smaller costs
when writing data. You can then perform efficient aggs or scan/scroll
operations on member documents with pre-summarised attributes e.g.
totalSpend rather than deriving these properties on-the-fly from records
with a shared member ID.

On Tuesday, February 10, 2015 at 7:03:17 AM UTC, piyush goyal wrote:

Well, my use case says I have tens of thousands of records for each
members. I want to do a simple terms aggs on member ID. If my count of
member ID remains same throughout .. good enough, if the number of members
keep on increasing, day by day ES has to keep more and more data into
memory to calculate the aggs. Does not sound very promising. What we do is
implementation of routing to put member specific data into a particular
shard. Why can't aggs be based on shard based calculations so that I am
safe from loading tons of data into memory.

Any thoughts?

On Sunday, 9 November 2014 22:58:12 UTC+5:30, pulkitsinghal wrote:

Sharing a response I received from Igor Motov:

"scroll works only to page results. paging aggs doesn't make sense

since aggs are executed on the entire result set. therefore if it managed
to fit into the memory you should just get it. paging will mean that you
throw away a lot of results that were already calculated. the only way to
"page" is by limiting the results that you are running aggs on. for example
if your data is sorted by date and you want to build histogram for the
results one date range at a time."

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/252f95d5-007c-4591-8939-a4baa133f344%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

1.) Would I assume that as my document count would increase, the time for
aggregation calculation would as well increase?

Yes - documents are processed serially through the tree but it happens
really quickly and each shard is doing this at the same time to produce its
summary which is the key to scaling if things start slowing.

2.) How would I relate this analogy with sub aggregations.

Each row of pins in the bean machine represents another decision point for
direction of travel - this is the equivalent of sub aggregation. The
difference is in the Bean machine each layer can only go in a choice of 2
directions - the ball goes left or right around a pin. In your first layer
of aggregation there are not 2 but 5k choices of direction - one bucket for
each member. Each member bucket is further broken down by 12 months.

My observation says that as you increase the number of child aggregations,
so it increases the execution time along with memory utilization. What
happens in case of sub aggregations?

Hopefully the previous comment should have made that clear. Each sub agg
represents an additional decision point that has to be negotiated by each
document - consider direction based on member ID, then next direction based
on month.

3.) I didn't get your last statement:
"There is however a fixed overhead for all queries which is a
function of number of docs and that is the Field Data cache required to
hold the dates/member IDs in RAM - if this becomes a problem then you may
want to look at on-disk alternative structure in the form of "DocValues"."

We are fast at routing docs through the agg tree because we can lookup
things like memberID for each matching doc really quickly and then
determine the appropriate direction of travel. We rely on these look-up
stores to be pre-warmed (i.e. held in Field Data arrays in the JVM or
cached by the file system when using DocValues disk-based equivalents) to
make our queries fast.
They are a fixed cost shared by all queries that make use of them.

4.) Off the topic, but I guess best to ask it here since we are talking
about it. :slight_smile: - DocValues - Since it was introduced in 1.0.0 and most of our
mapping was defined in ES 0.9, can I change the mapping of existing fields
now? Might be I can take this conversation in another thread but would love
to hear about 1-3 points. You made this thread very interesting for me.

I recommend you shift that off to another topic.

Thanks
Piyush

On Wednesday, 11 February 2015 15:12:37 UTC+5:30, Mark Harwood wrote:

5k doesn't sound too scary.

Think of the aggs tree like a "Bean Machine" [1] - one of those wooden
boards with pins arranged on it like a christmas tree and you drop balls at
the top of the board and they rattle down a choice of path to the bottom.
In the case of aggs, your buckets are the pins and documents are the balls

The memory requirement for processing the agg tree is typically the
number of pins, not the number of balls you drop into the tree as these
just fall out of the bottom of the tree.
So in your case it is 5k members multiplied by 12 months each = 60k
unique buckets, each of which will maintain a counter of how many docs pass
through that point. So you could pass millions or billions of docs through
and the working memory requirement for the query would be the same.
There is however a fixed overhead for all queries which is a function
of number of docs and that is the Field Data cache required to hold the
dates/member IDs in RAM - if this becomes a problem then you may want to
look at on-disk alternative structure in the form of "DocValues".

Hope that helps.

[1] http://en.wikipedia.org/wiki/Bean_machine

On Wednesday, February 11, 2015 at 7:04:04 AM UTC, piyush goyal wrote:

Hi Mark,

Before getting into queries, here is a little bit info about the project:

1.) A community where members keep on increasing, decreasing and
changing. Maintained in a different type.
2.) Approximately 3K to 4K documents of data of each user inserted into
ES per month in a different type maintained by member ID.
3.) Mapping is flat, there are no nested and array type of data.

Requirement:

Here is a sample requirement:

1.) Getting a report against each member ID against the count of data
for last three month.
2.) Query used to get the data is:

{
"query": {
"constant_score": {
"filter": {
"bool": {
"must": [
{"term": {
"datatype": "XYZ"
}
}, {
"range": {
"response_timestamp": {
"from": "2014-11-01",
"to": "2015-01-31"
}
}
}
]
}
}
}
},"aggs": {
"memberIDAggs": {
"terms": {
"field": "member_id",
"size": 0
},"aggs": {
"dateHistAggs": {
"date_histogram": {
"field": "response_timestamp",
"interval": "month"
}
}
}
}
},"size": 0
}

Now since the current member count is approximately 1K which will
increase to 5K in next 10 months. 5K * 4K * 3 times of documents to be used
for this aggregation. I guess a major hit on system. And this is only two
level of aggregation. Next requirement by our analyst is to get per month
data into three different categories.

What is the optimum solution to this problem?

Regards
Piyush

On Tuesday, 10 February 2015 16:15:22 UTC+5:30, Mark Harwood wrote:

these kind of queries are hit more for qualitative analysis.

Do you have any example queries? The "pay as you go" summarisation need
not be about just maintaining quantities. In the demo here [1] I derive
"profile" names for people, categorizing them as "newbies", "fanboys" or
"haters" based on a history of their reviewing behaviours in a marketplace.

By the way, are there any other strategies suggested by ES for these
kind of scenarios?

Igor hit on one which is to use some criteria eg. date to limit the
volume of what you analyze in any one query request.

[1]
http://www.elasticsearch.org/videos/entity-centric-indexing-london-meetup-sep-2014/

On Tuesday, February 10, 2015 at 10:05:24 AM UTC, piyush goyal wrote:

Thanks Mark. Your suggestion of "pay-as-you-go" seems amazing. But
considering the dynamics of the application, these kind of queries are hit
more for qualitative analysis. There are hundred of such queries(I am not
exaggerating) which are being hit daily by our analytic team. Keeping count
of all those qualitative checks daily and maintaining them as documents is
a headache itself. Addition/update/removals of these documents would cause
us huge maintenance overheads. Hence was thinking of getting something of
getting pagination on aggregations which would definitely help us to keep
our ES memory leaks away.

By the way, are there any other strategies suggested by ES for these
kind of scenarios?

Thanks

On Tuesday, 10 February 2015 15:20:40 UTC+5:30, Mark Harwood wrote:

Why can't aggs be based on shard based calculations

They are. The "shard_size" setting will determine how many member
summaries will be returned from each shard - we won't stream each
member's thousands of related records back to a centralized point to
compute a final result. The final step is to summarise the summaries from
each shard.

if the number of members keep on increasing, day by day ES has to
keep more and more data into memory to calculate the aggs

This is a different point to the one above (shard-level computation
vs memory costs). If your analysis involves summarising the behaviours of
large numbers of people over time then you may well find the cost of doing
this in a single query too high when the numbers of people are extremely
large. There is a cost to any computation and in that scenario you have
deferred all these member-summarising costs to the very last moment. A
better strategy for large-scale analysis of behaviours over time is to use
a "pay-as-you-go" model where you update a per-member summary document at
regular intervals with batches of their related records. This shifts the
bulk of the computation cost from your single query to many smaller costs
when writing data. You can then perform efficient aggs or scan/scroll
operations on member documents with pre-summarised attributes e.g.
totalSpend rather than deriving these properties on-the-fly from records
with a shared member ID.

On Tuesday, February 10, 2015 at 7:03:17 AM UTC, piyush goyal wrote:

Well, my use case says I have tens of thousands of records for each
members. I want to do a simple terms aggs on member ID. If my count of
member ID remains same throughout .. good enough, if the number of members
keep on increasing, day by day ES has to keep more and more data into
memory to calculate the aggs. Does not sound very promising. What we do is
implementation of routing to put member specific data into a particular
shard. Why can't aggs be based on shard based calculations so that I am
safe from loading tons of data into memory.

Any thoughts?

On Sunday, 9 November 2014 22:58:12 UTC+5:30, pulkitsinghal wrote:

Sharing a response I received from Igor Motov:

"scroll works only to page results. paging aggs doesn't make sense

since aggs are executed on the entire result set. therefore if it managed
to fit into the memory you should just get it. paging will mean that you
throw away a lot of results that were already calculated. the only way to
"page" is by limiting the results that you are running aggs on. for example
if your data is sorted by date and you want to build histogram for the
results one date range at a time."

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/985ad254-b363-459a-81d9-79e4f647bbd3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.