I have items that over the time can go from one category to the other.
Each change results in a new document with current state of the item.
I'm tying to get run basic analytics based on the latest state of the item like how many of them are in certain category or sum of values.
Basically I'm trying to collapse and aggregate results. But that's not possible.
Using PPL it's super simple, but does not scale, say few million docs.
search source=test-stats
| sort -timestamp
| dedup 1 id
| stats sum(value) as sum_value, COUNT(id) by category
My test docs:
{"_id": "doc-1-1", "id": "doc-1", "category": "c1", "value": 1010.0, "timestamp": "2023-07-07T00:00:00.000Z"}
{"_id": "doc-1-2", "id": "doc-1", "category": "c1", "value": 1012.0, "timestamp": "2023-07-08T00:00:00.000Z"}
{"_id": "doc-1-3", "id": "doc-1", "category": "c2", "value": 1013.0, "timestamp": "2023-07-09T00:00:00.000Z"}
{"_id": "doc-2-1", "id": "doc-2", "category": "c2", "value": 1021.0, "timestamp": "2023-07-07T00:00:00.000Z"}
{"_id": "doc-2-2", "id": "doc-2", "category": "c3", "value": 1022.0, "timestamp": "2023-07-10T00:00:00.000Z"}
{"_id": "doc-2-3", "id": "doc-2", "category": "c4", "value": 1023.0, "timestamp": "2023-07-11T00:00:00.000Z"}
{"_id": "doc-3-1", "id": "doc-3", "category": "c2", "value": 1031.0, "timestamp": "2023-07-07T00:00:00.000Z"}
{"_id": "doc-3-2", "id": "doc-3", "category": "c2", "value": 1032.0, "timestamp": "2023-07-10T00:00:00.000Z"}
{"_id": "doc-3-3", "id": "doc-3", "category": "c2", "value": 1033.0, "timestamp": "2023-07-11T00:00:00.000Z"}
Expected result would be:
sum_value COUNT(id) category
1023 1 c4
2046 2 c2