Logstash-output-elasticsearch Connection reset while trying to send bulk request to elasticsearch

pgrange · March 26, 2019, 1:46pm

Hello,

I have a logstash node which outputs data to an elasticsearch server. Both are hosted on a gcloud platform.

On the logstash server I often (several times a day) see Connection reset errors in the logs:

LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError
Attempted to send a bulk request to elasticsearch' but Elasticsearch appears to be unreachable or down!
Elasticsearch Unreachable: [http://elasticsearch.service.consul:9200/][Manticore::SocketException] Connection reset

As seen in another post, I tried to reduce validate_after_inactivity parameter to 2 seconds instead of the default 10 seconds value. I did that because connection reset really looks like something is wrong when trying to reuse a connection from the pool. But this did not help.

So I captured network traffic on both servers and waited for the error to occur again. And as a matter of fact, I did capture RST TCP packets sent by elasticsearch when the error occur.

Here is the story, as seen by logstash:

logstash talks to elasticsearch in the tcp connection and all is fine
at a moment in time, the connection is not used anymore but is kept open
several hours later (8 in one of the occurences), logstash tries to send data again in this connection but this time elasticsearch answer RST and logstash complaints about a connection having been reset.

Here is the story, as seen by elasticsearch:

elasticsearch receives bulks data from logstash in the tcp connection and all is fine
at a moment in time, the connection is not used anymore but is kept open
Two hours later (which matches default tcp timeout for inactive connections) elasticsearch sends a packet on the unused connection to check its status
No answer is received from logstash (and, as a matter of fact, logstash never received this packet) and so elasticsearch retries
After several failed attempts, elasticsearch closes the TCP connection with logstash
Several hours later (6 in the same occurence), elasticsearch receives a packet from logstash on this closed connection and refuses it, sending a RST packet.

So my interpretation is that logstash uses some connection pool to talk to elasticsearch and does not close these connections, keeping them open but silent. After a while, some network component (a firewall) closes the silent tcp session so that elasticsearch attempts to check for the liveness of the connection will fail.
This lead us to a half closed connection. Not a surprise here but what surprises me is that, when logstash wants to reuse this half-closed connection, the connection pool (I guess) does not detect this half-closed status and give it to logstash, which fail sending data to elasticsearch and complaints.

It seems that, at one point or another, logstash plugin relies on the isStale method specified here to check for the healthyness of the connection before using it:

github.com

apache/httpcomponents-core/blob/74753eb5d8c9e6aded21fe5a9a8c872447ef0e6b/httpcore5/src/main/java/org/apache/hc/core5/http/io/BHttpConnection.java

/*
 * ====================================================================
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing,
 * software distributed under the License is distributed on an
 * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 * KIND, either express or implied.  See the License for the
 * specific language governing permissions and limitations
 * under the License.
 * ====================================================================
 *

This file has been truncated. show original

That specification is quite odd and would explain the issue because half-closed connection can't be detected by just reading data from it as explained here: https://blog.stephencleary.com/2009/05/detection-of-half-open-dropped.html

But all that seem quite odd so I believe I missed something in my analysis. Does anyone experience the same kind of issues on their platform? How do you solve it?

Regards,
Pascal.

pgrange · March 26, 2019, 5:13pm

This gcloud doc explicit the fact that tcp connexion are closed after 10 minutes of inactivity:
https://cloud.google.com/vpc/docs/firewalls?hl=en

Setting this system parameters works around the issue by letting the OS send keepalive packets so that firewall won't close tcp connection:

net.ipv4.tcp_keepalive_time=60 net.ipv4.tcp_keepalive_intvl=60 net.ipv4.tcp_keepalive_probes=5

system · April 23, 2019, 5:13pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch output plugin: frequent Connection Reset errors Logstash	1	1799	July 6, 2017
Logstash stop sending the logs to the ElasticSearch Logstash	11	5523	July 6, 2017
Logstash losing connection to Elasticsearch nodes Logstash	5	4543	October 16, 2018
Attempted to send a bulk request to elasticsearch, but no there are no living connections in the connection pool Logstash	3	2626	August 12, 2021
Logstash can't talk to elasticsearch after several hours' running Logstash	7	1813	July 6, 2017

Logstash-output-elasticsearch Connection reset while trying to send bulk request to elasticsearch

Related topics