Advice needed multiple logstash instance Vs single

Hi, I am trying to scan multiple folder paths like dev, qa, prod (maybe 45 paths something like /var/tmp/xyz-dev-abc/.logs , or /var/tmp/xyz-qa-abc/.logs)
I have 15 such different envs(dev/qa/qa1 etc) and 3 such path per env
so i can have 15 logstash instances catering to 3 paths each
or I can have 1 logstash instance catering to 45 paths (with a type field set for each 15)
Which is better in terms of perf? It will be much easier for me to have 1 logstash instance but will it make logstash slow since it has to scan 45 folders instead of 3 ) Note if 15 logstash run they will need to run in same host as the paths are not NFS but /var/tmp.

Thanks
Andy

Why would the number of directories to scan differ depending on how many Logstash instances you're running? In the end it'll boil down to the same number of filename patterns in both cases, right?

Running 15 Logstash instances sounds insane and will waste a fair amount of RAM (plus you'll have an administrative overhead). You may want to increase the number of filter workers with the -w startup option but otherwise I don't see why a single instance would result in significantly worse performance.

Maybe I was not clear, each logstash instance will have a different set of paths to scan (3 in this case)
ie a different config file.
Where as if it was 1 logstash instance it will have 1 config in which there will be 45 paths like
file { type => "dev Grid" path => /var/tmp/xyz-dev*-abc/.log }
file { type => "qa Grid" path => /var/tmp/xyz-qa
-abc/*.log }
etc...

Thanks for your input.

Okay, but in the end you'll still have the same number of patterns to scan? I don't see why it would matter whether a set of configuration files is read by a single instance or if each file is read by a separate instance.

It all depends on how logstash is implemented, does it scan paths on multiple threads so as to achieve parallelism or does it scan them serially, since files in each path will grow pretty fast (like 60 lines /sec in my case) .By perf I mean the time difference between indexing in elastic (_timestamp field) and the time the logline was written into file. Maybe the best way is to test it , but was wondering if anyone has already done it and have some benchmark data?

Oh, you're talking about the tail operation itself rather than the filename pattern scan. Sorry.

Files selected by a single file input are scanned sequentially but each file input is run in a separate thread. I don't know the code that well, but I've observed that with multiple files in the same input plugin Logstash seems to read everything it can from file N until it gets EOF, then it continues with file N+1. If N grows too fast it could basically starve out the reading of the rest of the files.

But yes, experiment. And I suggest you start with a single Logstash instance with multiple file inputs. You probably don't need one input per file but you don't to put them all in a single file input either.

ok so you are saying in above example for each file type I will have a seperate thread? (seperate thread for dev and qa) my requirement is if dev is really busy I dont want qa to be blocked.

Hi,
If i understand ur requirement correctly, you are having different environments and in each environment you want to read 3 files and send the data to some output.
If you look at the log stash architecture, for each file we configured it will have a dedicated thread which will be reading the data and putting into the common queue.
The performance of ur logstash depends on how fast ur are draining the queue .. i.e. output which will take the data from the queue and handover to configured output.
Assume, you configure elasticsearch output, due to any reasons if elasticserach instance is not able to handle the rate at which data is read from the files, the queue will become full and the reading threads will wait till queue has some space for adding the new entries.
Hope I am able to answer ..

Hmm makes sense in my case , yes all these envs (dev qa etc) will point to different elastic clusters, however my question remains in above example config where 2 different file statements are present (under one input block) will they be executed in 2 different threads having 2 different queues (since if logstash uses 1 queue for all input files if one input is busy it will throttle the other input thread). This is the reason I ask if a sperate logstash process helps since in that case I know for sure that it will be a seperate input thread and seperate Queues feeding filters/output.

As per my knowledge, the queue will be common for all input threads.
So due to any reasons if queue is full it will impact all the inputs.
Can someone please confirm in logstash common queue will be used for all input threads??]
If this is the case, then in your case you need to have a separate log stash instance per input which will be running independently, so no impact.

Regards,
Ravi

Yes, Logstash uses a single event pipeline per instance. A single blocked output will block the whole pipeline. If that's not acceptable you can run multiple instances or have a single instance that passes events to a (single) broker. You can then have environment-specific Logstash instances that pull messages from that broker (one queue per environment) and feed its respective Elasticsearch cluster.

Thanks everyone for the inputs, in that case I guess the conclusion is that its a fair assumption to decide that the number of logstash instances should depend on the number out outputs (ie. elastic cluster in this case) rather than the number of inputs (files in this case) which I was thinking earlier.

In my case I have 8 servers which gives input to log stash. Each server writes on average of 600 lines per second. And I am maintaining 3 data & master nodes. How many logstash instances I need here???

The Logstash throughput depends on things like the version of Logstash, how powerful the machine is, and what kind of filters you have. You'll get the best results if you measure yourself on your machines and with your workload.

input {
file {
path => "/home/monika/cdr*.log"
}
}

filter {
mutate {split => { "message" => "|" }}
ruby {
code => "event['Actual Date/Time'] = event['message'][0]
......
event['MVNO-ID'] = event['message'][82]
event['message'] = 'dummy field'"
}
grok {
overwrite => [ "message" ]
}
}
output {
stdout {}
elasticsearch
{
hosts => ["172.16.15.153:9200","172.16.5.115:9200"]
index=>'test_db'
}
}

This is my conf file. Can I optimise this in any way. I have totally 82 fields in one line.

I am getting a pipe separated log. I am using split and saving the events
Can you guide me how to test that.. As per the discussion the speed is depend on either the input and also the output. How to judge that the problem is in my logstash filters speed or elastic search speed. I am maintaining 2 nodes (Both are master & data). Any suggestions???

If you want to measure the performance of software component X you'll want to remove everything that isn't related to X. In this case, disable the ES output and just dump the events to a file. If that's fast enough it's likely that Logstash isn't your bottleneck. The metrics filter can be useful for measuring throughput.