Problems receiving data from large number of logstash-forwarders on logstash


(Garry) #1

Hi,

I have 60+ clients that I am trying to connect to my logstash instance. They are sending logs on a short interval using LSF (Lumberjack).

Versions:
Logstash 2.0
Elasticsearch 2.0
Kibana 4.2
Logstash-forwarder 0.4.0

Logstash setup:
port 5000 used to receive from all LSF data.
20 workers
5g heap

input = lumberjack on port 5000
filters = basic grok and timestamp filters
output = file with filters + elasticsearch

The problem:

When I start up logstash all of the clients establish their connections but after a delay period (~15 seconds, the default LSF timeout period) the majority of them change state to CLOSE_WAIT or SYN_RECV. A handful remain ESTABLISHED and continue to send data without any problems. This is usually only 5-9 connections that remain ESTABLISHED at best and processing correctly (Not always the same connections out of the 60 that remain so I have crossed off specific LSF clients being the problem)

There is data in the RECV_Q of the CLOSE_WAIT connections. i.e. RECV_Q is not zero.
The CLOSE_WAIT and SYN_RECV connections have no PID attached (PID is '-' when running netstat -p as root) so I cannot close them manually as far as I am aware.
The SYN_RECV connections on the elk server are ESTABLISHED on the client side and have data in their SEND_Q

logstash logs intermittently shows ":message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection."

If this occurs I simply restart logstash. It does not happen consistently. I don't notice any discernible difference between runs when this line appears in the log and when it does not.

Doing a tcpdump on port 5000 shows constant traffic on the port when there shouldn't be between logging intervals. I am assuming it is clients attempting to connect.

Every connection that isn't ESTABLISHED on the main elk server usually has multiple duplicate connections showing on netstat..
e.g. 1 client could have 3 CLOSE_WAIT connections with data in their RECV_Q and a SYN_RECV

LSF is installed on the main elk server and is usually in the ESTABLISHED state but there is usually data in it's SEND_Q and it is not being processed.

Before LSF was rolled out to all clients it was tested using 5 clients of different OS and worked fine consistently.

Originally I was running logstash using default config. After increasing increasing heap from 500m -> 5g I noticed an increase in the ESTABLISHED connections initially and also when I increased the worker counter but it normally kills of connections eventually and stabilizes around 3 ESTABLISHED.

Can logstash handle this many connections?

Any help on this would be much appreciated.

Thanks.


(Mark Walkom) #2

Are you doing all your ingestion/receipt, filtering and output in a single instance?
Cause it could be something further down the chain. Adding a broker may help alleviate this.


(Thorsten Nickel) #3

You should be aware that Logstash only keeps 20 events maximum in it's processing pipeline, after which the pipeline gets into a blocking state. To mitigate this, either optimize your processing pipeline and/or use more worker processes with the new -w command line option.

Hope to help,
Thorsten


(Garry) #4

Hi, thanks for the replies,

@warkolm I am doing all my processing on a single instance yes. I was under the impression that I would not need a broker such as Redis if I am using LSF as it should stop sending when it is blocked but not break its connection. Do you believe that it would help?

@Thorsten_Nickel I am aware of the event cap. Right now I am using 20 workers and 5g of heap, do you believe I will need more? Or be able to support more?

Are there any reasons why there would be multiple duplicate connections for each client?

Also with half of my ESTABLISHED connections some data gets through but then data sits in the RECV_Q...
Once one client is processed it doesn't seem to release its hold on logstash


(Thorsten Nickel) #5

Hi,

if you would test your setup with an increased amoutn of workers, say 40 or 50, and see a different behaviour, then you might be on the right track.
I would also say, 5G of heap for the Logstash instance should be quite good.

Hope to help,
Thorsten


(Garry) #6

@Thorsten_Nickel Increased workers to 60. Noticed a decrease in stable ESTABLISHED connections i.e. connections that have 0 data in RECV_Q and SEND_Q

Increase in total ESTABLISHED connections with new workers but the majority have data in their RECV_Q that logstash isn't picking up.

Like I said, I don't think logstash is working correctly. Right now I only have 4 stable connections that have finished sending their data but logstash is ignoring all the other connections. It doesn't seem to be rotating between clients as it should.


(Mark Walkom) #7

It might be worth raising an issue on GH with as much info as possible.


(Garry) #8

@warkolm I looked into using redis as a broker but it appears that logstash-forwarder does not support redis at this time. is this correct?

What alternative would you recommend? I believe I do need something to take the pressure off logstash


(Mark Walkom) #9

You can go LSF > LS > redis > LS > ES, and run both LS instances on the same host.


(system) #10