Hi guys, I am opening this topic because I have a problem with a large amount of data (14M).
My dataset is composed as follows:
{"h":{"id":"AA001","process":"AK01","update-timestamp":1663665372171}}
{"h":{"id":"AA002","process":"AK01","update-timestamp":1663665372171}}
{"h":{"id":"AA003","process":"AK01","update-timestamp":1663665372171}}
{"h":{"id":"AA004","process":"AK01","update-timestamp":1663665372171}}
{"h":{"id":"AA001","process":"AK01","update-timestamp":1663665372172}}
{"h":{"id":"AA001","process":"AK01","update-timestamp":1663665372173}}
If my pipeline worked correctly, for each key (id and process) I should only have the most up-to-date update-timestamp.
So I'm trying to count the duplicate values: knowing how many update-timestamps are associated with the same id - process.
I tried to do it with kibana, with a datatable but the volumes are too high and it goes in error.
Could you help me with dsl?
Thanks in advance!
Salvo