We are using real-time rollup job to aggregare events into metrics. The scenario:
- events are coming from logstash, peak ingestion flow is around 100k-120k events per minute
- rollup job is scheduled to run every minute and time bucket size is 1m
We've noticed an anomaly with our scenario:
- If rollup job is running with no Latency buffer count of events in aggregated data is 5-7% less then in raw events.
- If rollup job is running with Latency buffer 1h - aggregated data perfectly matches raw events.
Out goal is to have aggregated metrics as real-time as possible, which ideally means to figure out Latency buffer as small as possible while making sure that no events are lost in aggregations. And this is hard to achieve without clear understanding of what:
- is going on behind the cover of rollup job (in particular, how they deal with indexing delays, with events added with a past time stamp; etc)
- factors influencing the ability to aggregate all events (Harware specs/Indexing rate/Indexing delay/etc)
So this is an open question - how to figure out the required latency buffer and influencing factors; what is currenlty the rollup ability to deal with indexing delays; who can share the experience of having real-time rollups; what are the recommendations for reducing required latency buffer.
Any help and knowledge share is appreciated!