Faster speed when indexing log files

Hi!

Right now I'm indexing about 15 log files with a total of 2 million log events all together. This takes a very long time and I am wondering if there is a way to speed up the process?

Best regards

Simon

How many events per second do you get? Are you sending them to Elasticsearch? What kind of hardware do you have? How many ES nodes? Do you have a single file input for all files? Are you saturating your CPUs or does it look like there's room to grow in that area? Have you increased the Logstash filter workers beyond the default value of one (changed with the -w startup option)? What kind of Logstash filters do you have? Any others that are likely to add latency (like dns)?

How many events per second do you get?

   -  1. Where do you check this? 

Are you sending them to Elasticsearch?

   -  2. Yes I am

What kind of hardware do you have?

   -  3. I have a  Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz with 12 cores.

How many ES nodes?

  -   4. Default (I guess you change this in the elasticsearch.yml file)

Do you have a single file input for all files?

  -   5. Yes I have a single file input for all the files.

Are you saturating your CPUs or does it look like there's room to grow in that area?

  -   6. Only 1 of the 12 cores.

Have you increased the Logstash filter workers beyond the default value of one (changed with the -w startup option)?

 -    7. No I haven't increased the logstash filter workers beyond the default value. How does the -w startup option   work?

What kind of Logstash filters do you have?

 -   8. I only have two logstash filters which are the grok filter and the date filter.
  1. Where do you check this?

The kopf plugin can give you the current ingestion rate, but you can of course also measure the time it takes to ingest a known number of messages and divide.

  1. Default (I guess you change this in the elasticsearch.yml file)

No, I meant how many ES servers you run. But just one then I guess.

  1. Yes I have a single file input for all the files.

Okay. Then I think the files will be processed serially in a single thread. If you split the files into multiple file inputs the processing should parallelize (for full effect make sure you run more than one filter worker).

How does the -w startup option work?

Pass -w N, where N is a reasonably small integer, via the LS_OPTS variable in /etc/default/logstash (Debian-based systems) or /etc/sysconfig/logstash (RPM-based systems).

On second thought I don't know if I have a single input for all the files. This is how my config looks like on the input side.

input {
  syslog {
    port => 5514
    codec => "json"
  }
  file {
    path => "/var/externallogs_maven/request.log.2015-06-04"
    type => "nexus-log"
    start_position => "beginning"
  }
  file {
    path => "/var/externallogs_maven/request.log.2015-06-05"
    type => "nexus-log"
    start_position => "beginning"
  }
  file {
    path => "/var/externallogs_maven/request.log.2015-06-06"
    type => "nexus-log"
    start_position => "beginning"
  }
  file {
    path => "/var/externallogs_maven/request.log.2015-06-07"
    type => "nexus-log"
    start_position => "beginning"
  }
  file {
    path => "/var/externallogs_maven/request.log.2015-06-08"
    type => "nexus-log"
    start_position => "beginning"
  }
  file {
    path => "/var/externallogs_maven/request.log.2015-06-09"
    type => "nexus-log"
    start_position => "beginning"
  }

  file {
    path => "/var/externallogs_maven/request.log.2015-06-10"
    type => "nexus-log"
    start_position => "beginning"

  }
  file {
    path => "/var/externallogs_maven/request.log.2015-06-11"
    type => "nexus-log"
    start_position => "beginning"
  }
  file {
    path => "/var/externallogs_yum/request.log.2015-06-06"
    type => "nexus-log"
    start_position => "beginning"
  }
  file {
    path => "/var/externallogs_yum/request.log.2015-06-07"
    type => "nexus-log"
    start_position => "beginning"
  }
  file {
    path => "/var/externallogs_yum/request.log.2015-06-08"
    type => "nexus-log"
    start_position => "beginning"
  }
  file {
    path => "/var/externallogs_yum/request.log.2015-06-09"
    type => "nexus-log"
    start_position => "beginning"
  }
}

I tried putting in LS_OPTS = "-w 24" into the etc/default/logstash. The CPU usage went from 80 % to between 250-300%. Although it might be able to go even faster. Is there someway to see if I succeeded in adding more workers? I also had the ruby debug stdout command active in my output in the logstash config along with the output to elasticsearch. Is it possible that this took some CPU usage?

No, in that case you have multiple inputs. However, with a single filter worker thread you won't be able to saturate your 12 cores.

How do I split the files into multiple file inputs? Do you do this in the logstash configuration?

You already have multiple inputs. What I think would help is having multiple filter workers (as previously described).

I wrote earlier that I thought the putting in LS_OPTS = "-w 24" into the etc/default/logstash made it have multiple filter workers but maybe I missunderstood your instructions.

I wrote earlier that I thought the putting in LS_OPTS = "-w 24" into the etc/default/logstash made it have multiple filter workers but maybe I missunderstood your instructions.

Sorry, that message totally slipped me by.

I tried putting in LS_OPTS = "-w 24" into the etc/default/logstash. The CPU usage went from 80 % to between 250-300%. Although it might be able to go even faster. Is there someway to see if I succeeded in adding more workers?

The fact that the CPU usage went up is of course a good sign that the option worked, but it's possible that Logstash logs something about this too (but perhaps only with --verbose or --debug).

I also had the ruby debug stdout command active in my output in the logstash config along with the output to elasticsearch. Is it possible that this took some CPU usage?

Yes, definitely. Disable any non-essential outputs.