You have quite a few objectives, and to keep things simple to start and explain the architecture I have which is +90 servers and 300GB a day of logs, of course there are many designs and this one was my first attempt so there maybe good and bad idea's in here
Sorting out and making indexes
First thing to think about is your indexes. The bigger they are the more brute force you need when searching.
So I tend to use the "type" field to sort out my data but feel free to get creative for your needs.
input {
file {
type => "apache-Access-log"
path => ["/var/log/http/access_log"]
}
output {
elasticsearch {
host => localhost
index => "%{type}-%{+YYYY.MM.dd}"
}
# stdout { codec=>"json" }
}
Architecture
Some ideas on design
- Logstash Forwarder -> Queue -> parsing indexer -> Elasticsearch
- Logstash Parser and Indexer -> Elasticsearch
Or what I have
Logstash and Parsing -> queue -> (Sub-parser ) & Indexer -> Elasticsearch
Explanation of my architecture
I do the bulk of my indexing on every server. Like Merging of multiple lines, parsing the data, and tagging the lines with important information. Like "web_acces_logs", "Prod" , etc . (90 servers parsing is better then a couple IMHO)
The Queue I run is Redis, but I think I will be going to Apache Kafka , but select the one you like the most.
The Indexer, is where I sort out the data. Determine which "Index" I want to place it in, throw away data, send information to Nagios, as well as some generic parsing I might want.
Note: This indexer has to be able to process every log, you may need multiples which is why there is a queue.
Elasticsearch:
here I do a couple of things:
-
Indexes are daily
-
I Create Mapping's with aliases , an alias of "loadavg" is easier then "loadavg-2015.06.21, loadavg-2015.06.22" or "loadavg*" which could bring in other folders I might not want.
-
I use this in my crontab to clean up older data after all we only have so much diskspace. You might find other ways like snapshots or exporting the data
https://github.com/elastic/curator