I am in the process of building a log analysis environment based on Elastic Stack in a classic setup:
Beats/Syslog -> Logstash -> Elasticsearch
However, there is a requirement for some of the logs to be stored for a long time (multiple years), and I don't want to keep indices in ES for that long, mostly due to performance concerns. I do want to be able to reindex these logs if the need arises in the future. My options so far seems to be:
Use the snapshot API in ES to retire old indices and restore them when needed. This seems quite handy, but it may cause compatibility issues if the restore is done years after the snapshot. Also, a complete index snaphot is not a very good standard format for archiving. I would prefer something that is as close to the input format as possible, yet still structured (JSON would be nice).
Use some kind of third party dump tool, like taskrabbit/elasticsearch-dump. This would export the data as JSON objects that could be archived for years. It seems like the restore process is quite simple also. The downsides seems to be performance related and I don't know how well this scales when hundreds of GB's are to be dumped/restored.
Use an additional Logstash output that saves the logs for archiving as well as in ES. This seems to me as the most simple solution, as I don't have to manage snapshots/restores and can just delete indices with Curator when they reach a certain age. I have made some tests with the standard file output in Logstash, which saves each event as JSON-lines. This is great if you want to locate events with grep before importing them to ES. The problem is that this format (json_lines codec) is not something that can be indexed to ES as is. I haved tried with the bulk API, but it requires additional header fields. Of course you could just loop through each line and add the headers, but I feel there must be a less "hacky" solution.
How do you guys handle this type of problem? Any comments or ideas are highly appreciated.