Joining aggregation results to the original documents

lyubimtsev · August 26, 2021, 8:28am

Hi
I need help to create a query that creates metrics aggregations, merge created metrics to original documents and then filters documents using these metrics.
Let's consider following dataset with managers' sales:

	manager	month	day	sale
0	Tom	1	1	385
1	Tom	1	2	225
2	Tom	1	3	327
3	Tom	1	4	118
4	Tom	1	5	307
5	Tom	1	6	400
6	Tom	1	7	259
7	Tom	1	8	255
8	Tom	1	9	264
9	Tom	2	1	355
10	Tom	2	2	329
11	Tom	2	3	320
12	Tom	2	4	256
13	Tom	2	5	330
14	Tom	2	6	369
15	Tom	2	7	227
16	Tom	2	8	373
17	Tom	2	9	246
18	John	1	1	304
19	John	1	2	238
20	John	1	3	313
21	John	1	4	227
22	John	1	5	302
23	John	1	6	319
24	John	1	7	209
25	John	1	8	250
26	John	1	9	311
27	John	2	1	322
28	John	2	2	188
29	John	2	3	304
30	John	2	4	349
31	John	2	5	273
32	John	2	6	487
33	John	2	7	307
34	John	2	8	229
35	John	2	9	375

I want to filter docs where daily sale greater than monthly avg by certain manager.
If I had used Python and Pandas I would made following:

df['avg_by_month'] = df.groupby(['manager', 'month'])['sell'].transform('mean')

New data:

	manager	month	day	sale	avg_by_month
0	Tom	1	1	385	282.222
1	Tom	1	2	225	282.222
2	Tom	1	3	327	282.222
3	Tom	1	4	118	282.222
4	Tom	1	5	307	282.222
5	Tom	1	6	400	282.222
6	Tom	1	7	259	282.222
7	Tom	1	8	255	282.222
8	Tom	1	9	264	282.222
9	Tom	2	1	355	311.667
10	Tom	2	2	329	311.667
11	Tom	2	3	320	311.667
12	Tom	2	4	256	311.667
13	Tom	2	5	330	311.667
14	Tom	2	6	369	311.667
15	Tom	2	7	227	311.667
16	Tom	2	8	373	311.667
17	Tom	2	9	246	311.667
18	John	1	1	304	274.778
19	John	1	2	238	274.778
20	John	1	3	313	274.778
21	John	1	4	227	274.778
22	John	1	5	302	274.778
23	John	1	6	319	274.778
24	John	1	7	209	274.778
25	John	1	8	250	274.778
26	John	1	9	311	274.778
27	John	2	1	322	314.889
28	John	2	2	188	314.889
29	John	2	3	304	314.889
30	John	2	4	349	314.889
31	John	2	5	273	314.889
32	John	2	6	487	314.889
33	John	2	7	307	314.889
34	John	2	8	229	314.889
35	John	2	9	375	314.889

And now let's filter docs:

df.loc[df.sell > df.avg_by_month]

	manager	month	day	sale	avg_by_month
0	Tom	1	1	385	282.222
2	Tom	1	3	327	282.222
4	Tom	1	5	307	282.222
5	Tom	1	6	400	282.222
9	Tom	2	1	355	311.667
10	Tom	2	2	329	311.667
11	Tom	2	3	320	311.667
13	Tom	2	5	330	311.667
14	Tom	2	6	369	311.667
16	Tom	2	8	373	311.667
18	John	1	1	304	274.778
20	John	1	3	313	274.778
22	John	1	5	302	274.778
23	John	1	6	319	274.778
26	John	1	9	311	274.778
27	John	2	1	322	314.889
30	John	2	4	349	314.889
32	John	2	6	487	314.889
35	John	2	9	375	314.889

How can I create similar query using Query DSL?
Thank you

spinscale · August 26, 2021, 3:37pm

If you are used to python and data frames, have you considered eland? See https://eland.readthedocs.io - this way you have DataFrames backed by Elasticsearch.

If you do not want to go that route, you could have an aggregation to figure out the monthly average, then have a terms agg for each manager and calculate the avg (potentially with time filters or date histograms for a per month view). You could also use a bucket aggregation to filter averages below the total monthly average.

hope this helps as a start.

lyubimtsev · August 26, 2021, 4:13pm

Hi, @spinscale
You didn't understand my problem))
I can retrieve avg sales using composite buckets by several terms (in my example: manager name and month).
But I don't know how to filter docs INSIDE elastic using calculated avg sales without downloading all docs to my workstation.
As a result of my dsl query I want to get docs, which sale field is grater than avg sales. I don't want to get separately ALL the docs to my workstation and avg aggregations and filter docs using my computer. I want to do that inside elastic.

spinscale · August 30, 2021, 10:04am

You need to run two requests then, first retrieving the average, then a second request to filter docs that are above the average. There is no way to properly do this in one request.

lyubimtsev · September 6, 2021, 5:20pm

OK, it's not a problem to make two separate requests.
But I don't know how to make second request properly.
In my example after first requests I will have avg values for particular manager and month. How can I filter documents using the following rules:
I a document has 'manager' field = 'Tom' AND 'month' field = 1 then compare 'sale' field with corresponding avg value, etc.

system · October 4, 2021, 5:21pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to get documents in Elasticsearch based on aggregation output values? Elasticsearch	2	1274	June 20, 2019
Could not able to Perform aggregations along with filters Elasticsearch	3	708	July 5, 2017
Is it possible to reference query/filer data in your aggregations Elasticsearch	2	322	July 6, 2017
How to put the aggregation value as query condition Elasticsearch	3	2041	July 6, 2017
Filtering the response into the "Avg Bucket Aggregation" Elasticsearch	3	421	November 3, 2020

Joining aggregation results to the original documents

Related topics