Logstash-output-elasticsearch load balancing not working when one of the nodes is down

Hi Team,

Currently we are working on negative testing of Elasticsearch multi-node clustering. Out current setup is:
node 1: Elasticsearch,logstash,kibana
node 2: Elasticsearch
node 3: Elasticsearch

the logstash on node 1 is pointing to all 3 nodes as below:

elasticsearch {
                    hosts => ["${ES_NODE_1}","${ES_NODE_2}","${ES_NODE_3}"]                    
                    index => "<index-name>"           
                    user => "${ES_USER_NAME}"
                    password => "${ES_USER_PASSWORD}"
                    ssl =>  true
                    cacert => "${ES_CERT_AUTH}"
                }

We have around 20 pipelines with all kinds of inputs (lumberjack, http_poller, jdbc).

We are observing that when one of the ES nodes is brought down, the pipelines having http_poller stop working (we call set of 3 APIs every minute). The ones based on lumberjack continue to work.

We continuously get this error in logstash-plain.log, which is expected:

[2021-10-19T17:43:02,646][WARN ][logstash.outputs.elasticsearch][<pipeline>] Attempted to resurrect connection to dead ES instance, but got an error. {:url=>"https://ES-NODE-1:9201/", :error_type=>LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError, :error=>"Elasticsearch Unreachable: [https://ES-NODE-1:9201/][Manticore::SocketException] Connection refused (Connection refused)"}
[2021-10-19T17:43:02,642][WARN ][logstash.outputs.elasticsearch][<module>] Marking url as dead. Last error: [LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError] Elasticsearch Unreachable: [https://es-node-1:9201/][Manticore::SocketException] Connection refused (Connection refused) {:url=>https://es-node-1:9201/, :error_message=>"Elasticsearch Unreachable: [https://es-node-1:9201/][Manticore::SocketException] Connection refused (Connection refused)", :error_class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError"}

We guessed that http_poller inputs might be getting starved of OS socket connections as all of them are being used up for internal healthcheck. We used below settings:

/etc/security/limits.conf:

logstash soft nofile 65536 
logstash hard nofile 65536
   resurrect_delay => 300
   retry_max_interval => 8

but still no luck.

Has anyone observed this behaviour ? is there any setting which we can assign to logstash output plugin to ensure it continues to work with one node down ?

Thanks

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.