Logstash-output-elasticsearch Connection reset while trying to send bulk request to elasticsearch

Hello,

I have a logstash node which outputs data to an elasticsearch server. Both are hosted on a gcloud platform.

On the logstash server I often (several times a day) see Connection reset errors in the logs:

LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError
Attempted to send a bulk request to elasticsearch' but Elasticsearch appears to be unreachable or down!
Elasticsearch Unreachable: [http://elasticsearch.service.consul:9200/][Manticore::SocketException] Connection reset

As seen in another post, I tried to reduce validate_after_inactivity parameter to 2 seconds instead of the default 10 seconds value. I did that because connection reset really looks like something is wrong when trying to reuse a connection from the pool. But this did not help.

So I captured network traffic on both servers and waited for the error to occur again. And as a matter of fact, I did capture RST TCP packets sent by elasticsearch when the error occur.

Here is the story, as seen by logstash:

  • logstash talks to elasticsearch in the tcp connection and all is fine
  • at a moment in time, the connection is not used anymore but is kept open
  • several hours later (8 in one of the occurences), logstash tries to send data again in this connection but this time elasticsearch answer RST and logstash complaints about a connection having been reset.

Here is the story, as seen by elasticsearch:

  • elasticsearch receives bulks data from logstash in the tcp connection and all is fine
  • at a moment in time, the connection is not used anymore but is kept open
  • Two hours later (which matches default tcp timeout for inactive connections) elasticsearch sends a packet on the unused connection to check its status
  • No answer is received from logstash (and, as a matter of fact, logstash never received this packet) and so elasticsearch retries
  • After several failed attempts, elasticsearch closes the TCP connection with logstash
  • Several hours later (6 in the same occurence), elasticsearch receives a packet from logstash on this closed connection and refuses it, sending a RST packet.

So my interpretation is that logstash uses some connection pool to talk to elasticsearch and does not close these connections, keeping them open but silent. After a while, some network component (a firewall) closes the silent tcp session so that elasticsearch attempts to check for the liveness of the connection will fail.
This lead us to a half closed connection. Not a surprise here but what surprises me is that, when logstash wants to reuse this half-closed connection, the connection pool (I guess) does not detect this half-closed status and give it to logstash, which fail sending data to elasticsearch and complaints.

It seems that, at one point or another, logstash plugin relies on the isStale method specified here to check for the healthyness of the connection before using it:

That specification is quite odd and would explain the issue because half-closed connection can't be detected by just reading data from it as explained here: https://blog.stephencleary.com/2009/05/detection-of-half-open-dropped.html

But all that seem quite odd so I believe I missed something in my analysis. Does anyone experience the same kind of issues on their platform? How do you solve it?

Regards,
Pascal.

This gcloud doc explicit the fact that tcp connexion are closed after 10 minutes of inactivity:
https://cloud.google.com/vpc/docs/firewalls?hl=en

Setting this system parameters works around the issue by letting the OS send keepalive packets so that firewall won't close tcp connection:

net.ipv4.tcp_keepalive_time=60 net.ipv4.tcp_keepalive_intvl=60 net.ipv4.tcp_keepalive_probes=5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.