I've got 121 gb of compressed apache logs files from 2014. We'd like to get all of it into elasticsearch.
Based on rough calculations based on some sample days it will take approximately 14 days. This is reasonable but I'd like to think I can do better.
Reading here I can tweak thread pools and queues, and I'll be looking into that, playing with settings, seeing how much time I can save on my sample runs.
But I came to ask you guys about my input methods.
A. Organization
16 application servers. I've got the data in one day, one gz file, in a directory under /var/log/2014 so:
/var/log/2014/app01
/var/log/2014/app02
etc
B. Method
My test runs is like so. It occurred to me this could be faster when I noticed that adding the date function to the filter appears to add approximately one minute and some seconds to the running time.
zcat /var/log/muo-test/app01/20141201.gz | /opt/logstash/bin/logstash -f fileimport.conf
fileimport.conf
input {
stdin {
type => "apache"
}
}
filter {
if [type] == "apache" {
grok {
match => { "message" => "%{COMBINEDAPACHELOG}" }
}
date {
match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
}
geoip {
source => "clientip"
target => "geoip"
database => "/etc/logstash/GeoLiteCity.dat"
add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}" ]
}
mutate {
convert => [ "[geoip][coordinates]", "float"]
}
if [clientip] in ["10.1.88.11", "10.1.88.12", "10.1.88.13", "10.1.88.14", "10.1.88.15", "10.1.88.16", "10.1.42.117", "10.1.42.118", "10.1.42.119", "10.1.88.21", "10.1.88.22", "10.1.88.23", "10.1.88.24", "10.1.88.25", "10.1.88.26", "10.1.42.127", "10.1.42.128", "10.1.42.129"] {
drop {}
}
}
}
output {
elasticsearch {
cluster => "elasticsearch.local"
host => "127.0.0.1"
protocol => http
index => "logstash-2014"
index_type => "apache"
}
}
C. For the actual run I plan to run it from bash, either using nohup or gnu parallel to run multiple instances of of the input process against each directory - one process for ./app01/201401*, a second for ./app01/201402*, a third for ./app02/201401, and so on. Launch it against a test month, see how many procs I can have running before I exhaust RAM.
D. Oh and figure a way to parameterize the index value - I think the default method of breaking indices into daily chunks is a good idea.
E. Your thoughts?