One of the things I love about logstash is the "sincedb" database which tracks current position and — if interrupted — remembers the location of where it left off. So, if Logstash is down for, say 4 hours, once restarted it catches up and nothing is lost.
I discovered the aggregate filter recently, but it's not behaving this way. If I stop logstash for a while, it simply restarts from the beginning. This should work, shouldn't it?
I'm processing Nginx access.log records (written in JSON) via one pipeline with two outputs. One output is sending the raw records to ES whereas the 5 minute aggregate is sending writing to file. I also tried writing the aggregation to ES.
Should aggregations be in their own pipeline seperate from the raw processing & output?
{"logstash.version"=>"7.3.2"}
Here's my input configuration.
Using an aggregate filter does not affect what an input reads. By default an aggregate starts from scratch when logstash restarts. You can use the aggregate_maps_path to persist the contents of the maps across restarts.
Agreed, yes, I'm using aggregate_maps_path. Here's my configuration. And when I start logstash up it always confirms that [INFO ][logstash.filters.aggregate] Aggregate maps loaded from : /usr/share/logstash/aggregate_maps.db.
Ah, adding timeout_timestamp_field took me one step further. Now when I start logstash, all the missing events (while logstash was down) are sent to ES. However, they all have a current @timestamp in ES. I'm trying to figure that out now.
It's working now! I needed to map the @timestamp field as follows, that way Elasticsearch doesn't use the @timestamp generated by Logstash at time of event, but instead uses the one from the logfile.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.