Elasticsearch output multiple hosts but no fault tolerance

Hi Elastic team,
We've designed an Elasticsearch cluster with 3 nodes and there is 2 independent Logstash instances that ingest data to this cluster. sometimes one of the ES cluster's node downs due to high load and can't be restarted automatically, and this causes the Logstash instances start logging Error messages about unavailability of that node and so no data continue ingesting.

my question is that isn't it sufficient for fault tolerance purposes to set an array of hosts for Elasticsearch output as below:

hosts => ["https://a.b.c.x:9200", "https://a.b.c.y:9200", "https://a.b.c.z:9200"]

Setting an array of hosts should be enough as Logstash will load balance between them, it will show errors for the node the is down, but will send to the others.

Can you share any logs when this happens?

1 Like

As you mentioned, I checked the logs again and see what you said about the logs when a node is down. yes the logs are just warnings (and some info) about connectivity issue to the node (and pipelines continue to work with other hosts!).

Just another question is that when a node goes down, in the first minutes of connection loss, there are some Error logs about bulk requests failure for failed node. it's absolutely normal to see these errors cause some inflight events were sent to that node, but how is the logstash behaviour in these scenarios, I mean does it retry sending those actions to the same failed node or it uses fault tolerance mechanism and retry on others hosts provided?

The Error logs are as follows:

[2024-06-03T08:40:39,627][ERROR][logstash.outputs.elasticsearch][ALL_AdHoc_N][0083d642cf9d6a024e81fa4f82353b9eee6e25041e001364a0fbc90c2c40e054] Attempted to send a bulk request but Elasticsearch appears to be unreachable or down {:message=>"Elasticsearch Unreachable: [https://X.Y.Z.67:9200/_bulk?filter_path=errors,items.*.error,items.*.status][Manticore::SocketTimeout] Read timed out", :exception=>LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError, :will_retry_in_seconds=>2}