We found the reason behind this issue. Problem is in configuration of Elasticsearch Ruby client (class https://github.com/elastic/elasticsearch-ruby/blob/master/elasticsearch-transport/lib/elasticsearch/transport/transport/base.rb).
When we have heavy traffic in fluentd we request Elasticsearch every second. Some of this request are timeouted by elasticsearch so this means that one of the connection in the connection pool is marked as dead.
Problem is how we resurrect connection. Class "base.rb" class in method "get_connection" tries to do that like this:
resurrect_dead_connections! if Time.now > @last_request_at + @resurrect_after
By default @resurrect_after is set to 60s (and Fluentd plugin for elasticsearch uses this value),
and @last_request_at is reset on each request in method "perform_request":
ensure @last_request_at = Time.now end
If we have high traffic we request elasticsearch every 1s and some of this request slowly kill connections. At some point we don't have enough connection, but we still try to post request, but @resurrect_after doesn't allow to resurrect any connection.
We fixed that by forking fluentd elasticsearch plugin and introducing resurrect_after as parameter. When we setup this to 0s we try to resurrect connection on every request. It is very aggressive setup, but I think we will try to find better based on our experiments.