A large amount of Logstash inputs

(ChiBi PonD) #1

I would like to achieve 30k inputs per machine to ElasticSearch. I have set up 5 ElasticSearch master nodes. As I want to get inputs from all 20 machines at 600K rate, I need to get 30k inputs per machine. Any suggestions? The config that I tried is shown as follows.

 input {
     exec {
         command => "echo hello1 $(date +'%d/%m/%Y %H:%M:%S:%3N') &"
         interval => 30
         type => "loadavg1"
     exec {
         command => "echo hello2 $(date +'%d/%m/%Y %H:%M:%S:%3N') &"
         interval => 30
         type => "loadavg2"
    exec {
         command => "echo hello30000 $(date +'%d/%m/%Y %H:%M:%S:%3N') &"
         interval => 30
         type => "loadavg30000"
output {
  elasticsearch {
    host => "test01"
    cluster  => ccm_elasticsearch # this matches out elasticsearch cluster.name
    protocol => http

Thank you in advance

(Magnus Bäck) #2

If you want to test the ability of generating and processing 30k events/s, having 30k exec inputs that potentially could attempt to fork off a process at the same is a really bad idea. Well, it's a bad idea regardless. How will the events be generated when you actually deploy this?

(Besides, 30k events every 30 seconds is just 1 event/second so I'm not sure what you're even trying to do here.)

(ChiBi PonD) #3

I plan to have a process or 50 processes writing the value to files in the real deployment. And then configure the Logstash to read from log files (30,000 values per machines). I am trying to find the best way to push the values at the rate of 600k. I agree with you that 30k Exec is the bad idea. However, I don't know right now what is the best way to push the 600,000 values per second from all 20 machines or 30k events per second. Please suggest me the solutions.

(Magnus Bäck) #4

How about modelling what you're actually trying to do, i.e. having Logstash read 50 files (probably best with more than one file input to improve parallelism) to which other processes (imitating your real event sources) are writing data?

(ChiBi PonD) #5

What I am actually trying now are

  1. A process pushes 1 value at a time until it meets 30k values (right now the process just generates the number from for loop) to one log file.
  2. 50 processes pushes values (1 process has 1 log file) to their own log files. In the Logstash config file has file inputs e.g. output1, output2, ..., output50.
    These 2 ways are what i'm testing now but it still uses more than 1 second. To be more specific, it is only 90k values per second. Maybe my processes are the cause of reducing rate. So i tried to use Exec inputs, which is the bad idea. I don't know how to push values more faster to increase the rate. 30k values should be pushed at the same time that what i just think of right now. Thank you Mr. magnusbaeck for helping me.

(Magnus Bäck) #6

Are you saturating your CPUs, or what seems to be the limiting factor? I can totally imagine that a single Logstash instance might have problems pushing more than 90k events/s.

(ChiBi PonD) #7

CPUs are saturated sometimes. Memory is still free. Is it possible to run mutiple Logstash instances? For example, run the first instance for the first 25 log files and run the second instance for the rest 25 log files.

(Mark Walkom) #8

You can use -w to tell LS to spawn more workers, but that will only help if you have CPU spare.

(Magnus Bäck) #9

Is it possible to run mutiple Logstash instances? For example, run the first instance for the first 25 log files and run the second instance for the rest 25 log files.

Yeah, sure.

(ChiBi PonD) #10

Is this different from running Logstash 2 times?

(Mark Walkom) #11

You mean two instances?

Yes and no. If you run two instances you get more throughput but at the cost of extra memory due to two JVMs running. Running more workers will be more efficient in that regards as they are all managed by the one JVM.

(system) #12