I am in the process of building a log analysis environment based on Elastic Stack in a classic setup:
Beats/Syslog -> Logstash -> Elasticsearch
However, there is a requirement for some of the logs to be stored for a long time (multiple years), and I don't want to keep indices in ES for that long, mostly due to performance concerns. I do want to be able to reindex these logs if the need arises in the future. My options so far seems to be:
Use the snapshot API in ES to retire old indices and restore them when needed. This seems quite handy, but it may cause compatibility issues if the restore is done years after the snapshot. Also, a complete index snaphot is not a very good standard format for archiving. I would prefer something that is as close to the input format as possible, yet still structured (JSON would be nice).
Use some kind of third party dump tool, like taskrabbit/elasticsearch-dump. This would export the data as JSON objects that could be archived for years. It seems like the restore process is quite simple also. The downsides seems to be performance related and I don't know how well this scales when hundreds of GB's are to be dumped/restored.
Use an additional Logstash output that saves the logs for archiving as well as in ES. This seems to me as the most simple solution, as I don't have to manage snapshots/restores and can just delete indices with Curator when they reach a certain age. I have made some tests with the standard file output in Logstash, which saves each event as JSON-lines. This is great if you want to locate events with grep before importing them to ES. The problem is that this format (json_lines codec) is not something that can be indexed to ES as is. I haved tried with the bulk API, but it requires additional header fields. Of course you could just loop through each line and add the headers, but I feel there must be a less "hacky" solution.
How do you guys handle this type of problem? Any comments or ideas are highly appreciated.
Our sýnesis™ Solutions provide this option. We use a multi-tier ingest architecture, where collected data is first captured and pushed to a message queue (important for environments expecting to ingest high-volumes of UDP traffic, e.g. syslog, SNMP traps, netflow, etc. There are then two options:
As data is pushed into the queue it can also be written to files. In the output of our collection pipelines it looks like this:
# Archive raw data if enabled.
if [@metadata][archive_raw] == "true" {
if [@metadata][archive_raw_period] == "monthly" {
file {
path => "${SYNESIS_ARCHIVE_RAW_PATH:/var/log/logstash/archive}/%{type}/%{host}-%{+YYYY.MM}.json.gz"
codec => json_lines
gzip => true
}
} else {
file {
path => "${SYNESIS_ARCHIVE_RAW_PATH:/var/log/logstash/archive}/%{type}/%{host}-%{+YYYY.MM.dd}.json.gz"
codec => json_lines
gzip => true
}
}
}
[@metadata][archive_raw] and [@metadata][archive_raw_period] could be set in a number of ways, such as environment variables, the translate filter, etc. You might also want to do weekly or even hourly archive files. It depends on your requirements.
If Kafka is used for queueing you can have a similar process listening to the "raw data" topics and write the data to files in a similar manner.
I would recommend that as you collect data you immediately assign it a UUID (logstash-filter-uuid). Then use this UUID to set the document_id in the elasticsearch output. This allows you to directly correlate archived logs with live logs. It will also allow you to replay archived logs, such as when adding updated processing logic for the messages, and easily overwrite the live logs with the improved version.
As you mention, the file input would be the way to "restore" archived data. The nice thing about having a message queue in the ingest architecture is that you can easily replay the archive data with a simple file input and message queue output (e.g. redis or kafka - depending on which you are using). You are basically injecting it into the data flow as if it were live, without the need to mess around with the rest of your environment. Since the archived data has timestamps and UUIDs it is put in place in Elasticsearch exactly where it should be.
I love Kafka, but don't underestimate what can be done with Redis. It may be all you need and is much easier to get running. Even on a single node deployment we use two instances of Logstash with Redis in the middle, and capture 1000's of logs and network flows per second. In this way the concept I describe above works for a single node install as well as large multi-node cluster.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.