Hi Mark,
Before getting into queries, here is a little bit info about the project:
1.) A community where members keep on increasing, decreasing and changing.
Maintained in a different type.
2.) Approximately 3K to 4K documents of data of each user inserted into ES
per month in a different type maintained by member ID.
3.) Mapping is flat, there are no nested and array type of data.
Requirement:
Here is a sample requirement:
1.) Getting a report against each member ID against the count of data for
last three month.
2.) Query used to get the data is:
{
"query": {
"constant_score": {
"filter": {
"bool": {
"must": [
{"term": {
"datatype": "XYZ"
}
}, {
"range": {
"response_timestamp": {
"from": "2014-11-01",
"to": "2015-01-31"
}
}
}
]
}
}
}
},"aggs": {
"memberIDAggs": {
"terms": {
"field": "member_id",
"size": 0
},"aggs": {
"dateHistAggs": {
"date_histogram": {
"field": "response_timestamp",
"interval": "month"
}
}
}
}
},"size": 0
}
Now since the current member count is approximately 1K which will increase
to 5K in next 10 months. 5K * 4K * 3 times of documents to be used for this
aggregation. I guess a major hit on system. And this is only two level of
aggregation. Next requirement by our analyst is to get per month data into
three different categories.
What is the optimum solution to this problem?
Regards
Piyush
On Tuesday, 10 February 2015 16:15:22 UTC+5:30, Mark Harwood wrote:
these kind of queries are hit more for qualitative analysis.
Do you have any example queries? The "pay as you go" summarisation need
not be about just maintaining quantities. In the demo here [1] I derive
"profile" names for people, categorizing them as "newbies", "fanboys" or
"haters" based on a history of their reviewing behaviours in a marketplace.
By the way, are there any other strategies suggested by ES for these kind
of scenarios?
Igor hit on one which is to use some criteria eg. date to limit the volume
of what you analyze in any one query request.
[1]
Elasticsearch Platform — Find real-time answers at scale | Elastic
On Tuesday, February 10, 2015 at 10:05:24 AM UTC, piyush goyal wrote:
Thanks Mark. Your suggestion of "pay-as-you-go" seems amazing. But
considering the dynamics of the application, these kind of queries are hit
more for qualitative analysis. There are hundred of such queries(I am not
exaggerating) which are being hit daily by our analytic team. Keeping count
of all those qualitative checks daily and maintaining them as documents is
a headache itself. Addition/update/removals of these documents would cause
us huge maintenance overheads. Hence was thinking of getting something of
getting pagination on aggregations which would definitely help us to keep
our ES memory leaks away.
By the way, are there any other strategies suggested by ES for these kind
of scenarios?
Thanks
On Tuesday, 10 February 2015 15:20:40 UTC+5:30, Mark Harwood wrote:
Why can't aggs be based on shard based calculations
They are. The "shard_size" setting will determine how many member
summaries will be returned from each shard - we won't stream each
member's thousands of related records back to a centralized point to
compute a final result. The final step is to summarise the summaries from
each shard.
if the number of members keep on increasing, day by day ES has to keep
more and more data into memory to calculate the aggs
This is a different point to the one above (shard-level computation vs
memory costs). If your analysis involves summarising the behaviours of
large numbers of people over time then you may well find the cost of doing
this in a single query too high when the numbers of people are extremely
large. There is a cost to any computation and in that scenario you have
deferred all these member-summarising costs to the very last moment. A
better strategy for large-scale analysis of behaviours over time is to use
a "pay-as-you-go" model where you update a per-member summary document at
regular intervals with batches of their related records. This shifts the
bulk of the computation cost from your single query to many smaller costs
when writing data. You can then perform efficient aggs or scan/scroll
operations on member documents with pre-summarised attributes e.g.
totalSpend rather than deriving these properties on-the-fly from records
with a shared member ID.
On Tuesday, February 10, 2015 at 7:03:17 AM UTC, piyush goyal wrote:
Well, my use case says I have tens of thousands of records for each
members. I want to do a simple terms aggs on member ID. If my count of
member ID remains same throughout .. good enough, if the number of members
keep on increasing, day by day ES has to keep more and more data into
memory to calculate the aggs. Does not sound very promising. What we do is
implementation of routing to put member specific data into a particular
shard. Why can't aggs be based on shard based calculations so that I am
safe from loading tons of data into memory.
Any thoughts?
On Sunday, 9 November 2014 22:58:12 UTC+5:30, pulkitsinghal wrote:
Sharing a response I received from Igor Motov:
"scroll works only to page results. paging aggs doesn't make sense
since aggs are executed on the entire result set. therefore if it managed
to fit into the memory you should just get it. paging will mean that you
throw away a lot of results that were already calculated. the only way to
"page" is by limiting the results that you are running aggs on. for example
if your data is sorted by date and you want to build histogram for the
results one date range at a time."
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/52742b60-0d60-4a31-a526-a4f0ce404919%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.