Faster speed when indexing log files

simonrisberg · August 21, 2015, 11:11am

Hi!

Right now I'm indexing about 15 log files with a total of 2 million log events all together. This takes a very long time and I am wondering if there is a way to speed up the process?

Best regards

Simon

magnusbaeck · August 21, 2015, 11:28am

How many events per second do you get? Are you sending them to Elasticsearch? What kind of hardware do you have? How many ES nodes? Do you have a single file input for all files? Are you saturating your CPUs or does it look like there's room to grow in that area? Have you increased the Logstash filter workers beyond the default value of one (changed with the -w startup option)? What kind of Logstash filters do you have? Any others that are likely to add latency (like dns)?

simonrisberg · August 21, 2015, 12:03pm

How many events per second do you get?

   -  1. Where do you check this?

Are you sending them to Elasticsearch?

   -  2. Yes I am

What kind of hardware do you have?

   -  3. I have a  Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz with 12 cores.

How many ES nodes?

  -   4. Default (I guess you change this in the elasticsearch.yml file)

Do you have a single file input for all files?

  -   5. Yes I have a single file input for all the files.

Are you saturating your CPUs or does it look like there's room to grow in that area?

  -   6. Only 1 of the 12 cores.

Have you increased the Logstash filter workers beyond the default value of one (changed with the -w startup option)?

 -    7. No I haven't increased the logstash filter workers beyond the default value. How does the -w startup option   work?

What kind of Logstash filters do you have?

 -   8. I only have two logstash filters which are the grok filter and the date filter.

magnusbaeck · August 21, 2015, 1:07pm

Where do you check this?

The kopf plugin can give you the current ingestion rate, but you can of course also measure the time it takes to ingest a known number of messages and divide.

Default (I guess you change this in the elasticsearch.yml file)

No, I meant how many ES servers you run. But just one then I guess.

Yes I have a single file input for all the files.

Okay. Then I think the files will be processed serially in a single thread. If you split the files into multiple file inputs the processing should parallelize (for full effect make sure you run more than one filter worker).

How does the -w startup option work?

Pass -w N, where N is a reasonably small integer, via the LS_OPTS variable in /etc/default/logstash (Debian-based systems) or /etc/sysconfig/logstash (RPM-based systems).

simonrisberg · August 21, 2015, 1:17pm

On second thought I don't know if I have a single input for all the files. This is how my config looks like on the input side.

input {
  syslog {
    port => 5514
    codec => "json"
  }
  file {
    path => "/var/externallogs_maven/request.log.2015-06-04"
    type => "nexus-log"
    start_position => "beginning"
  }
  file {
    path => "/var/externallogs_maven/request.log.2015-06-05"
    type => "nexus-log"
    start_position => "beginning"
  }
  file {
    path => "/var/externallogs_maven/request.log.2015-06-06"
    type => "nexus-log"
    start_position => "beginning"
  }
  file {
    path => "/var/externallogs_maven/request.log.2015-06-07"
    type => "nexus-log"
    start_position => "beginning"
  }
  file {
    path => "/var/externallogs_maven/request.log.2015-06-08"
    type => "nexus-log"
    start_position => "beginning"
  }
  file {
    path => "/var/externallogs_maven/request.log.2015-06-09"
    type => "nexus-log"
    start_position => "beginning"
  }

  file {
    path => "/var/externallogs_maven/request.log.2015-06-10"
    type => "nexus-log"
    start_position => "beginning"

  }
  file {
    path => "/var/externallogs_maven/request.log.2015-06-11"
    type => "nexus-log"
    start_position => "beginning"
  }
  file {
    path => "/var/externallogs_yum/request.log.2015-06-06"
    type => "nexus-log"
    start_position => "beginning"
  }
  file {
    path => "/var/externallogs_yum/request.log.2015-06-07"
    type => "nexus-log"
    start_position => "beginning"
  }
  file {
    path => "/var/externallogs_yum/request.log.2015-06-08"
    type => "nexus-log"
    start_position => "beginning"
  }
  file {
    path => "/var/externallogs_yum/request.log.2015-06-09"
    type => "nexus-log"
    start_position => "beginning"
  }
}

simonrisberg · August 21, 2015, 2:00pm

I tried putting in LS_OPTS = "-w 24" into the etc/default/logstash. The CPU usage went from 80 % to between 250-300%. Although it might be able to go even faster. Is there someway to see if I succeeded in adding more workers? I also had the ruby debug stdout command active in my output in the logstash config along with the output to elasticsearch. Is it possible that this took some CPU usage?

magnusbaeck · August 21, 2015, 2:07pm

No, in that case you have multiple inputs. However, with a single filter worker thread you won't be able to saturate your 12 cores.

simonrisberg · August 21, 2015, 2:13pm

How do I split the files into multiple file inputs? Do you do this in the logstash configuration?

magnusbaeck · August 21, 2015, 2:21pm

You already have multiple inputs. What I think would help is having multiple filter workers (as previously described).

simonrisberg · August 21, 2015, 2:33pm

I wrote earlier that I thought the putting in LS_OPTS = "-w 24" into the etc/default/logstash made it have multiple filter workers but maybe I missunderstood your instructions.

magnusbaeck · August 21, 2015, 2:40pm

I wrote earlier that I thought the putting in LS_OPTS = "-w 24" into the etc/default/logstash made it have multiple filter workers but maybe I missunderstood your instructions.

Sorry, that message totally slipped me by.

I tried putting in LS_OPTS = "-w 24" into the etc/default/logstash. The CPU usage went from 80 % to between 250-300%. Although it might be able to go even faster. Is there someway to see if I succeeded in adding more workers?

The fact that the CPU usage went up is of course a good sign that the option worked, but it's possible that Logstash logs something about this too (but perhaps only with --verbose or --debug).

I also had the ruby debug stdout command active in my output in the logstash config along with the output to elasticsearch. Is it possible that this took some CPU usage?

Yes, definitely. Disable any non-essential outputs.

Topic		Replies	Views
How to Increase Indexing rate Elasticsearch	10	4024	July 5, 2017
Logstash taking too long to process data Logstash	22	10058	March 2, 2017
Speed up processing of logs Logstash	7	6518	April 26, 2017
Logstash doesn't use all CPU available Logstash	14	4253	December 15, 2016
ELK Machine capacity for 10000 events per sec Logstash	9	2565	July 6, 2017

Faster speed when indexing log files

Related topics