We are using real-time rollup job to aggregare events into metrics. The scenario:
events are coming from logstash, peak ingestion flow is around 100k-120k events per minute
rollup job is scheduled to run every minute and time bucket size is 1m
We've noticed an anomaly with our scenario:
If rollup job is running with no Latency buffer count of events in aggregated data is 5-7% less then in raw events.
If rollup job is running with Latency buffer 1h - aggregated data perfectly matches raw events.
Out goal is to have aggregated metrics as real-time as possible, which ideally means to figure out Latency buffer as small as possible while making sure that no events are lost in aggregations. And this is hard to achieve without clear understanding of what:
is going on behind the cover of rollup job (in particular, how they deal with indexing delays, with events added with a past time stamp; etc)
factors influencing the ability to aggregate all events (Harware specs/Indexing rate/Indexing delay/etc)
So this is an open question - how to figure out the required latency buffer and influencing factors; what is currenlty the rollup ability to deal with indexing delays; who can share the experience of having real-time rollups; what are the recommendations for reducing required latency buffer.
The roll up query api supports querying rolled up data as well as raw data at the same time so I do not understand why delaying roll up processing would be a problem. The delay is supposed to be greater than your maximum indexing delay.
Current rollup implementation has certain limitations. In particular - single index (ie no daily pattern), no enrichment possibilities in process of rolling up or post rolling up, query limitations, lifecycle management limitations. Therefore we are creating our own aggregations based on rollup index, with enrichment in process. That's where the delay nuances are coming from, as well as what's behind the scenes of rollup jobs.
The delay is supposed to be greater than your maximum indexing delay.
That is a good answer to one of our questions (ie rollups vs indexing delays)! Btw, any good links as to how to calculate indexing delays? But the question whether rollups are able to aggregate the events added with a past time stamp remains open...
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.