Joining aggregation results to the original documents

Hi
I need help to create a query that creates metrics aggregations, merge created metrics to original documents and then filters documents using these metrics.
Let's consider following dataset with managers' sales:

manager month day sale
0 Tom 1 1 385
1 Tom 1 2 225
2 Tom 1 3 327
3 Tom 1 4 118
4 Tom 1 5 307
5 Tom 1 6 400
6 Tom 1 7 259
7 Tom 1 8 255
8 Tom 1 9 264
9 Tom 2 1 355
10 Tom 2 2 329
11 Tom 2 3 320
12 Tom 2 4 256
13 Tom 2 5 330
14 Tom 2 6 369
15 Tom 2 7 227
16 Tom 2 8 373
17 Tom 2 9 246
18 John 1 1 304
19 John 1 2 238
20 John 1 3 313
21 John 1 4 227
22 John 1 5 302
23 John 1 6 319
24 John 1 7 209
25 John 1 8 250
26 John 1 9 311
27 John 2 1 322
28 John 2 2 188
29 John 2 3 304
30 John 2 4 349
31 John 2 5 273
32 John 2 6 487
33 John 2 7 307
34 John 2 8 229
35 John 2 9 375

I want to filter docs where daily sale greater than monthly avg by certain manager.
If I had used Python and Pandas I would made following:

df['avg_by_month'] = df.groupby(['manager', 'month'])['sell'].transform('mean')

New data:

manager month day sale avg_by_month
0 Tom 1 1 385 282.222
1 Tom 1 2 225 282.222
2 Tom 1 3 327 282.222
3 Tom 1 4 118 282.222
4 Tom 1 5 307 282.222
5 Tom 1 6 400 282.222
6 Tom 1 7 259 282.222
7 Tom 1 8 255 282.222
8 Tom 1 9 264 282.222
9 Tom 2 1 355 311.667
10 Tom 2 2 329 311.667
11 Tom 2 3 320 311.667
12 Tom 2 4 256 311.667
13 Tom 2 5 330 311.667
14 Tom 2 6 369 311.667
15 Tom 2 7 227 311.667
16 Tom 2 8 373 311.667
17 Tom 2 9 246 311.667
18 John 1 1 304 274.778
19 John 1 2 238 274.778
20 John 1 3 313 274.778
21 John 1 4 227 274.778
22 John 1 5 302 274.778
23 John 1 6 319 274.778
24 John 1 7 209 274.778
25 John 1 8 250 274.778
26 John 1 9 311 274.778
27 John 2 1 322 314.889
28 John 2 2 188 314.889
29 John 2 3 304 314.889
30 John 2 4 349 314.889
31 John 2 5 273 314.889
32 John 2 6 487 314.889
33 John 2 7 307 314.889
34 John 2 8 229 314.889
35 John 2 9 375 314.889

And now let's filter docs:

df.loc[df.sell > df.avg_by_month]
manager month day sale avg_by_month
0 Tom 1 1 385 282.222
2 Tom 1 3 327 282.222
4 Tom 1 5 307 282.222
5 Tom 1 6 400 282.222
9 Tom 2 1 355 311.667
10 Tom 2 2 329 311.667
11 Tom 2 3 320 311.667
13 Tom 2 5 330 311.667
14 Tom 2 6 369 311.667
16 Tom 2 8 373 311.667
18 John 1 1 304 274.778
20 John 1 3 313 274.778
22 John 1 5 302 274.778
23 John 1 6 319 274.778
26 John 1 9 311 274.778
27 John 2 1 322 314.889
30 John 2 4 349 314.889
32 John 2 6 487 314.889
35 John 2 9 375 314.889

How can I create similar query using Query DSL?
Thank you

If you are used to python and data frames, have you considered eland? See https://eland.readthedocs.io - this way you have DataFrames backed by Elasticsearch.

If you do not want to go that route, you could have an aggregation to figure out the monthly average, then have a terms agg for each manager and calculate the avg (potentially with time filters or date histograms for a per month view). You could also use a bucket aggregation to filter averages below the total monthly average.

hope this helps as a start.

Hi, @spinscale
You didn't understand my problem))
I can retrieve avg sales using composite buckets by several terms (in my example: manager name and month).
But I don't know how to filter docs INSIDE elastic using calculated avg sales without downloading all docs to my workstation.
As a result of my dsl query I want to get docs, which sale field is grater than avg sales. I don't want to get separately ALL the docs to my workstation and avg aggregations and filter docs using my computer. I want to do that inside elastic.

You need to run two requests then, first retrieving the average, then a second request to filter docs that are above the average. There is no way to properly do this in one request.

OK, it's not a problem to make two separate requests.
But I don't know how to make second request properly.
In my example after first requests I will have avg values for particular manager and month. How can I filter documents using the following rules:
I a document has 'manager' field = 'Tom' AND 'month' field = 1 then compare 'sale' field with corresponding avg value, etc.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.