Shoving data old log data into elasticsearch: I feel the need, the need for speed

I've got 121 gb of compressed apache logs files from 2014. We'd like to get all of it into elasticsearch.

Based on rough calculations based on some sample days it will take approximately 14 days. This is reasonable but I'd like to think I can do better.

Reading here I can tweak thread pools and queues, and I'll be looking into that, playing with settings, seeing how much time I can save on my sample runs.

But I came to ask you guys about my input methods.

A. Organization
16 application servers. I've got the data in one day, one gz file, in a directory under /var/log/2014 so:

/var/log/2014/app01
/var/log/2014/app02
etc

B. Method

My test runs is like so. It occurred to me this could be faster when I noticed that adding the date function to the filter appears to add approximately one minute and some seconds to the running time.

zcat /var/log/muo-test/app01/20141201.gz | /opt/logstash/bin/logstash -f fileimport.conf

fileimport.conf

input {
      stdin {
          type => "apache"
      }
}


filter {
  if [type] == "apache" {
    grok {
      match => { "message" => "%{COMBINEDAPACHELOG}" }
    }
    date {
    	 match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
    }
    geoip {
      source => "clientip"
      target => "geoip"
      database => "/etc/logstash/GeoLiteCity.dat"
      add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
      add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}"  ]
    }
    mutate {
      convert => [ "[geoip][coordinates]", "float"]
    }

    if [clientip] in ["10.1.88.11", "10.1.88.12", "10.1.88.13", "10.1.88.14", "10.1.88.15", "10.1.88.16", "10.1.42.117", "10.1.42.118", "10.1.42.119", "10.1.88.21", "10.1.88.22", "10.1.88.23", "10.1.88.24", "10.1.88.25", "10.1.88.26", "10.1.42.127", "10.1.42.128", "10.1.42.129"]  {
       drop {}
    }
  }
}


output {
    elasticsearch {
        cluster => "elasticsearch.local"
        host => "127.0.0.1"
        protocol => http
        index => "logstash-2014"
        index_type => "apache"
    }
}

C. For the actual run I plan to run it from bash, either using nohup or gnu parallel to run multiple instances of of the input process against each directory - one process for ./app01/201401*, a second for ./app01/201402*, a third for ./app02/201401, and so on. Launch it against a test month, see how many procs I can have running before I exhaust RAM.

D. Oh and figure a way to parameterize the index value - I think the default method of breaking indices into daily chunks is a good idea.

E. Your thoughts?

Don't play with threadpools, you're just going to be pushing the load elsewhere without solving the problem.

You're also going to be limited with what zcat can provide, you might be better off extracting the files and then using workers (-w argument) to run across multiple files. Also try altering the flush_size parameter in your output, you may find your cluster works better with a lower/higher amount.

Finally we don't know what your ES cluster looks like, how many nodes, config etc, which is useful. And are you monitoring your ES cluster to see if it's resource constrained?

I'll get the rest of the information up next, but I fooled around with the -worker value against a single file, letting the thing run different values over dinner and a walk. It appears to be encouraging.

I simply mashed this into a bash file, letting each value run three times (-w 8, then -w 10) and manually averaged the results ..

time zcat /var/log/muo-test/app01/missuniverse.com-access_log-20141201.gz | /opt/logstash/bin/logstash -w $value -f fileimport.conf

A table with data ...

| worker | flush_size |   | Real      |  Avg |
|      8 |       5000 |   | 5m4.887s  |      |
|      8 |       5000 |   | 5m13.711s |      |
|      8 |       5000 |   | 5m20.452s | 5.24 |
|     10 |       5000 |   | 5m9.636s  |      |
|     10 |       5000 |   | 5m13.559s |      |
|     10 |       5000 |   | 4m54.661s | 5.19 |
|     12 |       5000 |   | 5m1.616s  |      |
|     12 |       5000 |   | 5m5.034s  |      |
|     12 |       5000 |   | 4m48.083s | 5.02 |
|     14 |       5000 |   | 4m58.269s |      |
|     14 |       5000 |   | 5m1.204s  |      |
|     14 |       5000 |   | 5m1.118s  |  4.9 |

[quote="warkolm, post:2, topic:2389"]
you might be better off extracting the files and then using workers (-w argument) to run across multiple files[/quote]

A thing I ran into using files path => "/var/log/test/app01/*' was that the process did not exit - it's expecting more input and of course, with those files there isn't going to be any more input.

However when I tried this it complained about "ambiguous redirect"

/opt/logstash/bin/logstash -f fileimport.conf -w 14 < /var/log/test/app01/2014120*

I am new to elasticsearch. I see how to obtain nodes information (below). Is there a singular GET that will dump all configuration information?

_cat/nodes

  localhost 127.0.0.1 82 34 3.93 d * Janet van Dyne              
  localhost 127.0.0.1 27         c - logstash-0.0.0.0-30607-7950 

I am running htop for a quick eyeball of stats. I have installed these plugins: head, and I'm using bigdesk to watch the server.

However when I tried this it complained about "ambiguous redirect"

/opt/logstash/bin/logstash -f fileimport.conf -w 14 < /var/log/test/app01/2014120*

Yeah, you can't redirect multiple files like that. But you can do this:

cat /var/log/test/app01/2014120* | /opt/logstash/bin/logstash -f fileimport.conf -w 14

I am new to elasticsearch. I see how to obtain nodes information (below). Is there a singular GET that will dump all configuration information?

The cluster state API should be useful.

Ah!

pastebin of the data

If you only have one node then adding more will increase the performance.

1 Like