Syslog load balancing in front of Logstash?

I have a lot of syslogs flowing from multiple sources into four logstash nodes and then to Elastic. At this time I am just using DNS round robin to balance the Logstash nodes for the inbound syslog traffic.

I want to use something like HAProxy or NGINX to load balance the connection and provide failover with keepalived. I have both running and seem to have issue with Linux omfwd loosing connection and lost records. Neither feel stable. The issue seems worse with HAProxy but is present with both load balancing solutions.

I should also mention that each env (dev,test,qa,prod) are a different pipeline port. This works fine in either solution. Also we have systems using both RFC3164 and RFC5424.

I am looking for example configurations and advice on which way folks are going. Also I am open to other ideas. Our LTM is off limits so I have to go with these solutions or something else. I tried a ring buffer in HAProxy but could not make that work properly as we have both RFC's. Admittedly that might have been just me and my lack of understanding.

What issues are you having?

I do not use rsyslog, but I have Logstash behind a HAProxy LB without any issues.

What does your logstash configuration looks like? And your HAProxy configuration?

Good morning!

Thank you for the reply. I believe I am close, but still I cannot figure it out.

The Logstash side has custom pipelines with separate ports for our specific syslog needs. They work great when we point our syslog sources either directly to the Logstash hosts or to a DNS round-robin IP. With the HAProxy solution, I am using keepalived for failover between two HAProxy servers with identical configuration so the syslog sources point to the keepalived VIP.

The issue is with HAProxy. Moreso with rsyslog where I get the following error and subsequent data loss.

rsyslogd: omfwd: TCPSendBuf error -2027, destruct TCP Connection to logstash-vip.example.com:5066 [v8.2001.0 try https://www.rsyslog.com/e/2027 ]

Here is an example of one of the Logstash pipelines.

input {
  tcp {
    port => 5066
    type => syslog
  }
  udp {
    port => 5066
    type => syslog
  }
}

filter {
  if ([message] =~ /^$/) {
    drop {  }
  }

  mutate {
    add_field => { 
      "[logstash][pipeline_id]" => "tcp-syslog-prod"
      "[logstash][hostname]" => "${HOSTNAME}"
      "[example][env]" => "prod" 
    }
  }

  ruby { 
    code => "event.set('logstash_timestamp', Time.now())"
  }
  
}

output {
  elasticsearch { 
    hosts => [ 
      "https://ingest:9200", 
      "https://ingest:9200", 
      "https://ingest:9200", 
      "https://ingest:9200", 
      "https://ingest:9200", 
      "https://ingest:9200"
    ] 
    user => "elastic"  
    password => "${elasticsearch.pwd}" 
    manage_template => false
    data_stream => true
    data_stream_type => "logs"
    data_stream_dataset => "syslog"
    data_stream_namespace => "prod"
    pipeline => "syslog"
    ecs_compatibility => "v1"
  }
}

This is the keepalived configuration from the master HAProxy node. Keepalived works as expected and will failover properly between nodes.

global_defs {
  vrrp_version 3
}

# Script to check whether Nginx is running or not
vrrp_script check_nginx {
  script "/usr/bin/killall -0 haproxy"
  interval 2
  weight -60
}


vrrp_instance instance1 {
  state MASTER
  interface ens160
  priority 100
  virtual_router_id 1
  advert_int 1
  virtual_ipaddress {
    10.161.x.x/22
  }

  track_script {
    check_haproxy
  }
}

Here is my sanitized haproxy.cfg file. Again I have seen where people are using a ring buffer, but I am completely confused with that aspect.

global
    log /dev/log local0
    log-tag haproxy
    chroot /var/lib/haproxy
    user haproxy
    group haproxy
    daemon

defaults
    log     global
    mode    tcp
    option  tcplog
    timeout connect 5000ms
    timeout client  50000ms
    timeout server  50000ms

frontend frontend_5063
    bind *:5063
    mode tcp
    default_backend backend_5063

frontend frontend_5064
    bind *:5064
    mode tcp
    default_backend backend_5064

frontend frontend_5065
    bind *:5065
    mode tcp
    default_backend backend_5065

frontend frontend_5066
    bind *:5066
    mode tcp
    default_backend backend_5066

backend backend_5063
    mode tcp
    balance roundrobin
    server logstash-p01_5063 logstash-p01.example.com:5063 check
    server logstash-p02_5063 logstash-p02.example.com:5063 check
    server logstash-p03_5063 logstash-p03.example.com:5063 check
    server logstash-p04_5063 logstash-p04.example.com:5063 check

backend backend_5064
    mode tcp
    balance roundrobin
    server logstash-p01_5064 logstash-p01.example.com:5064 check
    server logstash-p02_5064 logstash-p02.example.com:5064 check
    server logstash-p03_5064 logstash-p03.example.com:5064 check
    server logstash-p04_5064 logstash-p04.example.com:5064 check

backend backend_5065
    mode tcp
    balance roundrobin
    server logstash-p01_5065 logstash-p01.example.com:5065 check
    server logstash-p02_5065 logstash-p02.example.com:5065 check
    server logstash-p03_5065 logstash-p03.example.com:5065 check
    server logstash-p04_5065 logstash-p04.example.com:5065 check

backend backend_5066
    mode tcp
    balance roundrobin
    server logstash-p01_5066 logstash-p01.example.com:5066 check
    server logstash-p02_5066 logstash-p02.example.com:5066 check
    server logstash-p03_5066 logstash-p03.example.com:5066 check
    server logstash-p04_5066 logstash-p04.example.com:5066 check

# Status page configuration
frontend stats
    bind *:8080
    mode http
    stats enable
    stats uri /haproxy_stats
    stats refresh 10s

There is a long thread on github about this behavior with rsyslog and multiple tools, I don't think this is restricted to HAProxy.

It seems that the issue originates by the connection being closed between the packets and rsyslog needing to start a new connection, on the thread some people solved it by using keep alive on the receiver side.

There is not much you can do on Logstash side, but there is an option to enable tcp keel alive, you can try adding tcp_keep_alive => true in your input to see if it helps.

This is not an issue with Logstash, so not sure if you will get much help here as this forum is focused on Elastic tool, but maybe someone has a similar scenario and was able to solve it.

Thannk you for your help. I do see that there may be an issue with rsyslog. I looked at HAProxy, keepalived and even Logstash but nothing stood out.

I am going to try syslog-ng and see if that helps the situation.