Hi
I need help to create a query that creates metrics aggregations, merge created metrics to original documents and then filters documents using these metrics.
Let's consider following dataset with managers' sales:
| manager | month | day | sale | |
|---|---|---|---|---|
| 0 | Tom | 1 | 1 | 385 |
| 1 | Tom | 1 | 2 | 225 |
| 2 | Tom | 1 | 3 | 327 |
| 3 | Tom | 1 | 4 | 118 |
| 4 | Tom | 1 | 5 | 307 |
| 5 | Tom | 1 | 6 | 400 |
| 6 | Tom | 1 | 7 | 259 |
| 7 | Tom | 1 | 8 | 255 |
| 8 | Tom | 1 | 9 | 264 |
| 9 | Tom | 2 | 1 | 355 |
| 10 | Tom | 2 | 2 | 329 |
| 11 | Tom | 2 | 3 | 320 |
| 12 | Tom | 2 | 4 | 256 |
| 13 | Tom | 2 | 5 | 330 |
| 14 | Tom | 2 | 6 | 369 |
| 15 | Tom | 2 | 7 | 227 |
| 16 | Tom | 2 | 8 | 373 |
| 17 | Tom | 2 | 9 | 246 |
| 18 | John | 1 | 1 | 304 |
| 19 | John | 1 | 2 | 238 |
| 20 | John | 1 | 3 | 313 |
| 21 | John | 1 | 4 | 227 |
| 22 | John | 1 | 5 | 302 |
| 23 | John | 1 | 6 | 319 |
| 24 | John | 1 | 7 | 209 |
| 25 | John | 1 | 8 | 250 |
| 26 | John | 1 | 9 | 311 |
| 27 | John | 2 | 1 | 322 |
| 28 | John | 2 | 2 | 188 |
| 29 | John | 2 | 3 | 304 |
| 30 | John | 2 | 4 | 349 |
| 31 | John | 2 | 5 | 273 |
| 32 | John | 2 | 6 | 487 |
| 33 | John | 2 | 7 | 307 |
| 34 | John | 2 | 8 | 229 |
| 35 | John | 2 | 9 | 375 |
I want to filter docs where daily sale greater than monthly avg by certain manager.
If I had used Python and Pandas I would made following:
df['avg_by_month'] = df.groupby(['manager', 'month'])['sell'].transform('mean')
New data:
| manager | month | day | sale | avg_by_month | |
|---|---|---|---|---|---|
| 0 | Tom | 1 | 1 | 385 | 282.222 |
| 1 | Tom | 1 | 2 | 225 | 282.222 |
| 2 | Tom | 1 | 3 | 327 | 282.222 |
| 3 | Tom | 1 | 4 | 118 | 282.222 |
| 4 | Tom | 1 | 5 | 307 | 282.222 |
| 5 | Tom | 1 | 6 | 400 | 282.222 |
| 6 | Tom | 1 | 7 | 259 | 282.222 |
| 7 | Tom | 1 | 8 | 255 | 282.222 |
| 8 | Tom | 1 | 9 | 264 | 282.222 |
| 9 | Tom | 2 | 1 | 355 | 311.667 |
| 10 | Tom | 2 | 2 | 329 | 311.667 |
| 11 | Tom | 2 | 3 | 320 | 311.667 |
| 12 | Tom | 2 | 4 | 256 | 311.667 |
| 13 | Tom | 2 | 5 | 330 | 311.667 |
| 14 | Tom | 2 | 6 | 369 | 311.667 |
| 15 | Tom | 2 | 7 | 227 | 311.667 |
| 16 | Tom | 2 | 8 | 373 | 311.667 |
| 17 | Tom | 2 | 9 | 246 | 311.667 |
| 18 | John | 1 | 1 | 304 | 274.778 |
| 19 | John | 1 | 2 | 238 | 274.778 |
| 20 | John | 1 | 3 | 313 | 274.778 |
| 21 | John | 1 | 4 | 227 | 274.778 |
| 22 | John | 1 | 5 | 302 | 274.778 |
| 23 | John | 1 | 6 | 319 | 274.778 |
| 24 | John | 1 | 7 | 209 | 274.778 |
| 25 | John | 1 | 8 | 250 | 274.778 |
| 26 | John | 1 | 9 | 311 | 274.778 |
| 27 | John | 2 | 1 | 322 | 314.889 |
| 28 | John | 2 | 2 | 188 | 314.889 |
| 29 | John | 2 | 3 | 304 | 314.889 |
| 30 | John | 2 | 4 | 349 | 314.889 |
| 31 | John | 2 | 5 | 273 | 314.889 |
| 32 | John | 2 | 6 | 487 | 314.889 |
| 33 | John | 2 | 7 | 307 | 314.889 |
| 34 | John | 2 | 8 | 229 | 314.889 |
| 35 | John | 2 | 9 | 375 | 314.889 |
And now let's filter docs:
df.loc[df.sell > df.avg_by_month]
| manager | month | day | sale | avg_by_month | |
|---|---|---|---|---|---|
| 0 | Tom | 1 | 1 | 385 | 282.222 |
| 2 | Tom | 1 | 3 | 327 | 282.222 |
| 4 | Tom | 1 | 5 | 307 | 282.222 |
| 5 | Tom | 1 | 6 | 400 | 282.222 |
| 9 | Tom | 2 | 1 | 355 | 311.667 |
| 10 | Tom | 2 | 2 | 329 | 311.667 |
| 11 | Tom | 2 | 3 | 320 | 311.667 |
| 13 | Tom | 2 | 5 | 330 | 311.667 |
| 14 | Tom | 2 | 6 | 369 | 311.667 |
| 16 | Tom | 2 | 8 | 373 | 311.667 |
| 18 | John | 1 | 1 | 304 | 274.778 |
| 20 | John | 1 | 3 | 313 | 274.778 |
| 22 | John | 1 | 5 | 302 | 274.778 |
| 23 | John | 1 | 6 | 319 | 274.778 |
| 26 | John | 1 | 9 | 311 | 274.778 |
| 27 | John | 2 | 1 | 322 | 314.889 |
| 30 | John | 2 | 4 | 349 | 314.889 |
| 32 | John | 2 | 6 | 487 | 314.889 |
| 35 | John | 2 | 9 | 375 | 314.889 |
How can I create similar query using Query DSL?
Thank you