Hi Team,
Currently we are working on negative testing of Elasticsearch multi-node clustering. Out current setup is:
node 1: Elasticsearch,logstash,kibana
node 2: Elasticsearch
node 3: Elasticsearch
the logstash on node 1 is pointing to all 3 nodes as below:
elasticsearch {
hosts => ["${ES_NODE_1}","${ES_NODE_2}","${ES_NODE_3}"]
index => "<index-name>"
user => "${ES_USER_NAME}"
password => "${ES_USER_PASSWORD}"
ssl => true
cacert => "${ES_CERT_AUTH}"
}
We have around 20 pipelines with all kinds of inputs (lumberjack, http_poller, jdbc).
We are observing that when one of the ES nodes is brought down, the pipelines having http_poller stop working (we call set of 3 APIs every minute). The ones based on lumberjack continue to work.
We continuously get this error in logstash-plain.log, which is expected:
[2021-10-19T17:43:02,646][WARN ][logstash.outputs.elasticsearch][<pipeline>] Attempted to resurrect connection to dead ES instance, but got an error. {:url=>"https://ES-NODE-1:9201/", :error_type=>LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError, :error=>"Elasticsearch Unreachable: [https://ES-NODE-1:9201/][Manticore::SocketException] Connection refused (Connection refused)"}
[2021-10-19T17:43:02,642][WARN ][logstash.outputs.elasticsearch][<module>] Marking url as dead. Last error: [LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError] Elasticsearch Unreachable: [https://es-node-1:9201/][Manticore::SocketException] Connection refused (Connection refused) {:url=>https://es-node-1:9201/, :error_message=>"Elasticsearch Unreachable: [https://es-node-1:9201/][Manticore::SocketException] Connection refused (Connection refused)", :error_class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError"}
We guessed that http_poller inputs might be getting starved of OS socket connections as all of them are being used up for internal healthcheck. We used below settings:
/etc/security/limits.conf:
logstash soft nofile 65536
logstash hard nofile 65536
resurrect_delay => 300
retry_max_interval => 8
but still no luck.
Has anyone observed this behaviour ? is there any setting which we can assign to logstash output plugin to ensure it continues to work with one node down ?
Thanks