Optimize logstash tcp input plugin

Hello,

I have 10 Kubernetes clusters forward their logs to logstash VM (k8s fluentd ---> logstash port 7000) .

Logstash gets to a point where logs are being missed and source pods doing retries to get logs through . ( errors I see on this case are listed below).

Looking for recommendation to optimize logstash tcp input .

Errors on logstash

[ERROR][logstash.inputs.tcp      ] xxxxxxxxxxxxxxx/x.x.x.x:16591: closing due:
java.net.SocketException: Connection reset
        at sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:394) ~[?:?]
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:426) ~[?:?]
        at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:253) ~[netty-all-4.1.65.Final.jar:4.1.65.Final]
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132) ~[netty-all-4.1.65.Final.jar:4.1.65.Final]
        at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350) ~[netty-all-4.1.65.Final.jar:4.1.65.Final]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) [netty-all-4.1.65.Final.jar:4.1.65.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) [netty-all-4.1.65.Final.jar:4.1.65.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655) [netty-all-4.1.65.Final.jar:4.1.65.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581) [netty-all-4.1.65.Final.jar:4.1.65.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) [netty-all-4.1.65.Final.jar:4.1.65.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [netty-all-4.1.65.Final.jar:4.1.65.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-all-4.1.65.Final.jar:4.1.65.Final]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-all-4.1.65.Final.jar:4.1.65.Final]
        at java.lang.Thread.run(Thread.java:833) [?:?]

Current Input snippet

input {
      tcp {
        codec => fluent
        port => 7000
        tcp_keep_alive => true
      }
}

Changes I tried without success

  • Optimize sysctl on logstash VM
 vm.max_map_count=262144
fs.file-max=65535
net.core.netdev_max_backlog=250000
net.core.netdev_budget=600
net.ipv4.tcp_mem=16777216 16777216 16777216
  • Increases ring buffers
ethtool -g ens192
Ring parameters for ens192:
Pre-set maximums:
RX:             4096
RX Mini:        2048
RX Jumbo:       4096
TX:             4096
Current hardware settings:
RX:             4096
RX Mini:        2048
RX Jumbo:       4096
TX:             4096

1 Like

What is your output?

Sometimes if your output can't keep it up with the rate of events it will tell logstash to backoff a little and this can cause some small issues, like the logstash input queue filling up etc.

Also, what are the pipeline configurations like workers, batch size etc? Are you using the default memory queues or persistent queues?

1 Like

Output is listed below. for pipeline , I didn't change anything from defaults.

output 

if ([cluster]) {
      elasticsearch {
        hosts => ["https://xxxxxx1:9200","https://xxxxxx2:9200","https://xxxxxxx3:9200"]
        user => "admin"
        password => "xxxxxxxxxx"
        index => "%{[cluster]}-%{+YYYY-MM-dd}"
        ssl_certificate_verification => false
        ecs_compatibility => disabled
      }
  }
}

pipeline file 

- pipeline.id: k8sclusters
  path.config: "/usr/share/logstash/files/k8sclusters.conf"
  pipeline.ecs_compatibility: disabled

@leandrojmp I am using pipeline defaults , using memory queues . The VM have 10cpus/64GM memory. Not sure how to improve performance ?

The default batch size for the pipelines is pretty low, it is 125 events per batch.

You may try to increase it and see if the things improve.

adding the line pipeline.batch.size: 250 in your pipeline config would double it.

Not sure what kind of logs you are collecting, but I have some firewalls with high event rate that I needed to use pipeline.batch.size: 1000 to solve some performance issues.

Do I change to persistent queues?

I don't think it will help much, I would try first changing the pipeline batch size to make logstash send more events to Elasticsearch on each request.

I increased pipeline.batch.size from 125 --> 250-->500--> 1000 and increased pipeline.workers to 12 . Still no improvement. It takes 1.5-2 days after logstash restart to get to this state.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.