Hi, we have an ELK stack running in a Kubernetes cluster (ELK ES nodes are EC2, but everything else is pods). The other Kube clusters send traffic to this ELK stack over TLS. This is done by a logstash-relay pod running in each Kube cluster. Rabbitmq is a couple of clustered pods not in HA mode. All logstash pods are the official docker image docker.elastic.co/logstash/logstash:5.6.5, with a couple of customizations (add a SumoLogic output plugin, and change the java security setting networkaddress.cache.ttl=60 so that it doesn't cache hostnames FOREVER).
The basic flow of messages through our system is:
- The logstash-relay pod in other Kube stacks accepts json and forwards it to a logstash-ingester in the KubeELK stack.
- The logstash ingester simply accepts and pushes into a rabbitmq queue.
- A logstash queue runner pops messages off that queue, does processing and filtering, then delivers it to one or more "destination" rabbitmq queues.
- Each destination queue has its own queue runner that sends to ES, sends to SumoLogic, sends to FooService, etc.
logstash-ingester -> queue_ingester
logstash-queue from queue_ingester -> queue_es or queue_sumologic or queue_pagerduty
logstash-es-post from queue_es -> ES
logstash-sumlogic-post from queue_sumologic -> SumoLogic
logstash-fooservice-post from queue_fooservice -> FooService
In Kube, it's common for a pod to die for any number of reasons: node scale-down, rolling-upgrade of nodes, memory issue with a pod, high load causes LivenessProbe failures, etc. When a rabbitmq pod dies, Kube reschedules it to start, usually on a different node. This results in a different IP address for the pod.
Any logstash queue runner using a queue on that rabbitmq pod sees that the connection dies, logs that it will reconnect, but it tries once, there is no further messaging, and that logstash pod hangs and never recovers. Here's some logs:
[2018-02-19T02:31:05,205][WARN ][logstash.inputs.rabbitmq ] RabbitMQ connection was closed! {:url=>"amqp://foo:XXXXXX@rabbitmq.dev-blue.svc.cluster.local:5672/", :automatic_recovery=>true, :cause=>com.rabbitmq.client.ShutdownSignalException: connection error}
[2018-02-19T04:43:53,413][WARN ][com.rabbitmq.client.impl.ForgivingExceptionHandler] An unexpected connection driver error occured (Exception message: Connection reset)
Caught exception when recovering queue queue_ingester
Caught exception when recovering consumer amq.ctag-ZX7_xgmrmnozhnU6Ba2jVQ
[2018-02-19T04:43:58,432][WARN ][logstash.inputs.rabbitmq ] RabbitMQ connection was closed! {:url=>"amqp://foo:XXXXXX@rabbitmq.dev-blue.svc.cluster.local:5672/", :automatic_recovery=>true, :cause=>com.rabbitmq.client.ShutdownSignalException: connection error}
Nothing further appears in the logs, I've waited as long as the whole weekend on low volume queues, and 3-6 million queued messages on higher volume queues (a few hours).
Here's my rabbitmq input config:
input {
rabbitmq {
exchange => "${RABBITMQ_EXCHANGE}"
queue => "${RABBITMQ_QUEUE}"
durable => true # Store messages when consumers gone
auto_delete => false # If true then queue deleted on exit of logstash... bad choice for logging
host => "${RABBITMQ_SERVER_BLUE}"
key => "${RABBITMQ_QUEUE}"
subscription_retry_interval_seconds => 5
threads => 1
connection_timeout => 10000
user => "foo"
password => "bar"
}
I do configure it to attempt to reconnect every 5 seconds, but that setting isn't useful until a connection to RabbitMQ is established. So I also configure that each connection attempt should be marked as failed after 10 seconds. By my understanding, it should be attempting to reconnect every 10 seconds. Assuming that the IP has changed on the new pod, but has the same name, it should be able to reconnect within about 2 cycles (rabbitmq is pretty quick to start up).
All I have to do to recover is delete the logstash pods and the new ones connect right up to the rabbitmq node running on the new IP and it works just fine.
Can anybody spot in my input configuration a misconfiguration or omission that would explain why logstash tries to reconnect to rabbitmq once and then never tries again?
Or is it that it's only ever trying to connect to the IP address that it first connected to rabbitmq on? I specifically use a hostname expecting it to look it up each time, and that is why I set the java security setting to 60 second cache instead of infinity. I expected that to prevent the caching of the IP forever.
Any and all suggestions are appreciated. I'm out of ideas to try.