Logstash.outputs.elasticsearch - Failed to perform request: Connection refused

I am using an on-prem 8.11.3 elastic+logstash stack.
logstash uses elasticsearch output plugin to post events to elastic.
current set-up uses 2 servers.

elasticsearch {
hosts => [ "srv1:9200", "srv2:9200" ]
data_stream_dataset => "ds-dataset"
data_stream_namespace => "default"
ssl_enabled => true
ssl_certificate_authorities => "/etc/pki/tls/certs/ca-bundle.crt"
api_key => "localapi:key"
}

Recently, I got an incident with one of these 2 elastic servers, elastic service failed and stayed down.

During this elastic service downtime on one of the servers, logstash constantly tried to connect to it and failed posting events leading to events’ lost.

My question is how to make this more robust to avoid loosing events ?

What do you mean by downtime? If the output doesn’t have a connection to the target elasticsearch server it shouldn’t be trying to post anything and therefore should not complain about failing to post events.

Please provide much more detail on why you think you are losing events.

By elastic service downtime, I meant when elasticsearch.service was in failed state and was not running on srv1.

During this downtime, logstash wrote this kind of message every second to its logfiles:

[2026-03-11T00:00:01,406][INFO ][logstash.outputs.elasticsearch][main] Failed to perform request {:message=>"Connect to srv1:9200 [srv1/10.1.1.1] failed: Connection refused", :exception=>Manticore::SocketException, :cause=>#<Java::OrgApacheHttpConn::HttpHostConnectException: Connect to srv1:9200 [srv1/10.1.1.1] failed: Connection refused>}
[2026-03-11T00:00:01,407][WARN ][logstash.outputs.elasticsearch][main] Attempted to resurrect connection to dead ES instance, but got an error {:url=>"``https://srv1:9200/``", :exception=>LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError, :message=>"Elasticsearch Unreachable: [``https://srv1:9200/``][Manticore::SocketException] Connect to srv1:9200 [srv1/10.1.1.1] failed: Connection refused"}

And I found entries in application logfiles I did not find in elasticsearch.

After re-starting elasticsearch.service on srv1, the node re-joined the cluster and situation was restored.

What was down? Just one server of the cluster or the entire cluster? Can you provide more context about your Elasticsearch cluster? How many nodes do you have etc.

Your Logstash output has 2 servers, if one of them is down it would use to the other and you should not lose any data unless the entire cluster was down.

Also, provide context about your logstash pipelines, what are your inputs?

It is a cluster of 5 servers.

logstash output elasticsearch plugin’s “hosts” parameter is set with 2 servers out of these 5.

Only one of the 2 servers got an elasticsearch.service in failed state, the other one was running.

And at cluster level, all other 4 nodes were up and running. Cluster was running fine.

logstash inputs are all coming from filebeat.

after searching for a solution in logstash/elastic, I come to conclusion the best way is to implement a local reverse proxy to manage traffic between logstash and elastic.

I’ll use nginx to do the job.

Nginx will just do balancing load. As Budger and Leandro said check ES logs, find what cause "Connection refused".

As I said, one of the 2 elasticsearch.service was down and in failed state. Reason is unknown. It just crashed. Which is not an issue, the cluster was in perfect shape.

But still, logstash continued trying to connect to the failed elasticsearch instance.

I think nginx will better manage this than logstash who seems to not manage it at all.

If your cluster was up and just one of the nodes was offline, Logstash should still use the other node to send the data, unless it was unable to connect to it for other reasons.

You will need to troubleshoot this further, I don ´t think a reverser proxy between Logstash and Elasticsearch will do anything else than add another piece in the puzzle, Logstash manages the load balance itself.

You didn't provide any context about your pipelines, so it is very complicated to try to understand what may happened without it.

only one main pipeline here

  • input beat
  • filters to filter and parse data
  • elasticsearch output using 2 serveurs sending events to different datastreams

About troubleshooting, in elasticsearch log, there is no error message explaining why it crashed/stopped. elasticsearch GC log stopped a couple of minutes before logstash started complaining elastic is not reachable anymore.

So, I have no information in logs explaining why it crashed. The only think I have is logstash logs recording many and many tries attempting to resurrect connection every second during the all downtime.

Perfect? Not an issue? You have no idea why your node is crashing, but it's not an issue? Thats .... unusual.

Are you symptom chasing, instead of root cause analysis ?

You obviously know your setup better than me/us. But as has been pointed out by a couple of the forums top Jedi, with a config like:

elasticsearch {
hosts => [ "srv1:9200", "srv2:9200" ]
...
}

and correctly configured logstash, you should not lose events if only one of those 2 endpoints is temporarily unavailable. You will see logs telling you that the endpoint is down, how many logs would be determined by your specfiic logging configuration.

I finally found why elasticsearch failed.

system ran out of memory and invoked oom-killer.

I will just add a couple of Gb to the VM’s to avoid the same in future.

And following your recommendations, I will keep current set up “as is” with logstash using at least 2 elasticsearch servers in “hosts” parameters.

Thank you so much for your recommendations/help.

Regards,

JM

1 Like