I'm running a Logstash instance which is connected to an ES cluster behind a load balancer.
The load balancer has an idle timeout of 5 minutes.
Logstash is configured with the ES url corresponding to the loadbalancer ip.
Normally everything works fine, but what happens is that after a period of requests inactivity, the next request processed by LS goes in error with the following:
[2018-10-30T08:15:00,757][WARN ][logstash.outputs.elasticsearch] Marking url as dead. Last error: [LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError] Elasticsearch Unreachable: [http://10.100.24.254:9200/][Manticore::SocketTimeout] Read timed out {:url=>http://10.100.24.254:9200/, :error_message=>"Elasticsearch Unreachable: [http://10.100.24.254:9200/][Manticore::SocketTimeout] Read timed out", :error_class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError"}
[2018-10-30T08:15:00,759][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch' but Elasticsearch appears to be unreachable or down! {:error_message=>"Elasticsearch Unreachable: [http://10.100.24.254:9200/][Manticore::SocketTimeout] Read timed out", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError", :will_retry_in_seconds=>2}
[2018-10-30T08:15:02,760][WARN ][logstash.outputs.elasticsearch] UNEXPECTED POOL ERROR {:e=>#<LogStash::Outputs::ElasticSearch::HttpClient::Pool::NoConnectionAvailableError: No Available connections>}
[2018-10-30T08:15:02,760][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch, but no there are no living connections in the connection pool. Perhaps Elasticsearch is unreachable or down? {:error_message=>"No Available connections", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::NoConnectionAvailableError", :will_retry_in_seconds=>4}
[2018-10-30T08:15:05,651][INFO ][logstash.outputs.elasticsearch] Running health check to see if an Elasticsearch connection is working {:healthcheck_url=>http://10.100.24.254:9200/, :path=>"/"}
LS eventually recovers, but it takes more than 1 min and this is not acceptable for our SLA.
I suspect that's due to the loadbalancer closing the connections after 5 min of inactivity.
I've tried setting:
timeout => 3
which makes things better. The request is retried after 3 secs, but this is still not good enough.
What's the best set of configuration options that I can use to make sure the connections are always healthy and working before the requests are attempted and so I experience no delay at all?