We have a dozen or so servers we pull access logs off of with logstash running on each of those and they're sent to our ES cluster. I've now configured a 1 hour Rollup in our ES cluster. Over the years we've run into a few occurrences where Logstash was down for a day or two before we caught the problem at which point we started up logstash and all the old records were sent to ES.
But what happens with rollups in this scenario. If the rollup job is running every hour will it detect that older data was added to ES? Or will it simply be ignored?
The latter option: late data is currently ignored after the time period is rolled up.
If you know there is a reasonable chance of late-arriving data, you can set the delay parameter on the rollup job. That prevent the job from rolling up a bucket of time until the delay has passed. E.g. if you set it to 24h, it will wait 24 hours after the bucket is "finished" before rolling it up, to allow late data to arrive.
The downside is of course your rollups (and aggregations/visualizations/etc) are always lagging by the delay amount. If you don't mind the delay -- or have live data to fill in the gap -- adding a day or seven to the delay would be the solution.
It's theoretically possible to go back and update the bucket when new data arrives. We'd probably need the user to invoke some sort of "re-rollup" API to target a specific time period, and then we could go remove all existing docs for that interval and re-rollup. But it sounded sufficiently complicated we decided to leave that on the wishlist for now
Yeah, a "re-rollup" API would be amazing, but I can appreciate its complexity. Would it work for a "re-rollup" to create a newer "_version" of the given doc and then remove the doc with the older _version?
For now, I've reconciled to re-indexing my rollup in the event we're missing data. This happens a couple times a year.
Actually, if the _rollup_search could point to an alias, this would partially solve the problem for us. If we created a new version of a rollup, we would simply update the alias to point to the newer index.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.