Logstash losing connection to Elasticsearch nodes


(Nick Peirson) #1

We've got two Logstash nodes that read from redis instances and output to an Elasticsearch cluster. The ES cluster has 8 instances. Three instances store the data on their own servers, three master nodes on their own servers that don't store any indices and two nodes on the same server as the Logstash instances configured purely to handle ingestion.

What we're seeing in the logstash logs is an ES instance being marked as dead due to connection reset, then immediately being brought back to life when a health check runs. This doesn't seem to affect logs being indexed initially, but eventually we stop getting new logs into the Elasticsearch index. This state then persists until we restart logstash. This affects each logstash instance independently, i.e. one will fail to submit logs while the other instance continues to submit logs with no problems. After restarting a logstash instance we don't see any ES nodes marked as dead in the logstash logs for a while (~1 hour).

Initially I thought this was related to ufw as we were seeing ufw dropping packets in syslog that correlated with the connection reset messages in the logstash logs, however when I disabled ufw on both sides (ES and Logstash servers) we continued to see the connection reset errors in the logstash logs, so the UFW errors appear to by a symptom rather than a cause.

There doesn't appear to be anything of note in the ES logs when the issue occurs.

Logstash logs:

[2018-08-21T00:04:09,982][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch' but Elasticsearch appears to be unreachable or down! {:error_message=>"Elasticsearch Unreachable: [http://elasticsearch4:9200/][Manticore::SocketException] Connection reset", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError", :will_retry_in_seconds=>64}
[2018-08-21T00:04:10,567][INFO ][logstash.outputs.elasticsearch] Running health check to see if an Elasticsearch connection is working {:healthcheck_url=>http://elasticsearch4:9200/, :path=>"/"}
[2018-08-21T00:04:10,575][WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>"http://elasticsearch4:9200/"}
[2018-08-21T00:06:18,786][WARN ][logstash.outputs.elasticsearch] Marking url as dead. Last error: [LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError] Elasticsearch Unreachable: [http://elasticsearch4:9200/][Manticore::SocketException] Broken pipe (Write failed) {:url=>http://elasticsearch4:9200/, :error_message=>"Elasticsearch Unreachable: [http://elasticsearch4:9200/][Manticore::SocketException] Broken pipe (Write failed)", :error_class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError"}
[2018-08-21T00:06:18,786][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch' but Elasticsearch appears to be unreachable or down! {:error_message=>"Elasticsearch Unreachable: [http://elasticsearch4:9200/][Manticore::SocketException] Broken pipe (Write failed)", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError", :will_retry_in_seconds=>64}
[2018-08-21T00:06:20,826][INFO ][logstash.outputs.elasticsearch] Running health check to see if an Elasticsearch connection is working {:healthcheck_url=>http://elasticsearch4:9200/, :path=>"/"}
[2018-08-21T00:06:20,833][WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>"http://elasticsearch4:9200/"}

Syslog:

Aug 21 00:04:09 logstash3 kernel: [1687551.513569] [UFW BLOCK] IN=enp0s31f6 OUT= MAC=90:1b:0e:cc:50:95:30:b6:4f:d8:05:51:08:00 SRC=xx.xx.xx.89 DST=xx.xx.xx.133 LEN=40 TOS=0x00 PREC=0x00 TTL=57 ID=56200 DF PROTO=TCP SPT=9200 DPT=60188 WINDOW=0 RES=0x00 RST URGP=0
Aug 21 00:06:35 logstash3 kernel: [1687697.665295] [UFW BLOCK] IN=enp0s31f6 OUT= MAC=90:1b:0e:cc:50:95:30:b6:4f:d8:05:51:08:00 SRC=xx.xx.xx.234 DST=xx.xx.xx.133 LEN=40 TOS=0x00 PREC=0x00 TTL=61 ID=61257 DF PROTO=TCP SPT=9200 DPT=60070 WINDOW=0 RES=0x00 RST URGP=0
Aug 21 00:06:35 logstash3 kernel: [1687697.665300] [UFW BLOCK] IN=enp0s31f6 OUT= MAC=90:1b:0e:cc:50:95:30:b6:4f:d8:05:51:08:00 SRC=xx.xx.xx.234 DST=xx.xx.xx.133 LEN=40 TOS=0x00 PREC=0x00 TTL=61 ID=61258 DF PROTO=TCP SPT=9200 DPT=60070 WINDOW=0 RES=0x00 RST URGP=0
Aug 21 00:06:49 logstash3 kernel: [1687712.280361] [UFW BLOCK] IN=enp0s31f6 OUT= MAC=90:1b:0e:cc:50:95:30:b6:4f:d8:05:51:08:00 SRC=xx.xx.xx.27 DST=xx.xx.xx.133 LEN=40 TOS=0x00 PREC=0x00 TTL=60 ID=13083 DF PROTO=TCP SPT=9200 DPT=55788 WINDOW=0 RES=0x00 RST URGP=0

Any thoughts on what I should be looking at?


(Nick Peirson) #2

I produced a packet capture (sudo tcpdump -i enp0s31f6 -s 65535 -w logstash.pcap 'port 9200') while we were seeing the error

[2018-08-23T07:20:11,108][WARN ][logstash.outputs.elasticsearch] Marking url as dead. Last error: [LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError] Elasticsearch Unreachable: [http://elasticsearch3:9200/][Manticore::SocketException] Connection reset {:url=>http://elasticsearch3:9200/, :error_message=>"Elasticsearch Unreachable: [http://elasticsearch3:9200/][Manticore::SocketException] Connection reset", :error_class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError"}

In the packet capture I can see elasticsearch respond with a 400 error:

HTTP/1.0 400 Bad Request
content-type: application/json; charset=UTF-8
content-length: 203

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "text is empty (possibly HTTP/0.9)"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "text is empty (possibly HTTP/0.9)"
  },
  "status": 400
}

I then see 5 TCP RST packets from elasticsearch to logstash. The last RST packet was at 07:20:11.107663, which matches the time of the logstash error, 07:20:11,108.

I'm finding the packet capture a little confusing. I've filtered the packet capture down to the IPs of the logstash and ES server in question, and to the ports where we see the 400 and RSTs, however prior to the 400 there appears to be a _bulk request in progress, then a 414 response during the request in the middle of the _bulk request, then the _bulk request continues, then we get the 400:

Logstash -> ES

POST /_bulk HTTP/1.1
Connection: Keep-Alive
Content-Type: application/json
Content-Length: 167481062
Host: elasticsearch3:9200
User-Agent: Manticore 0.6.1
Accept-Encoding: gzip,deflate
{ ... data ... }

ES -> Logstash

HTTP/1.1 413 Request Entity Too Large
content-length: 0

Logstash -> ES

{ ... data continues ... }

ES -> Logstash

HTTP/1.0 400 Bad Request
content-type: application/json; charset=UTF-8
content-length: 203
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"text is empty (possibly HTTP/0.9)"}],"type":"illegal_argument_exception","reason":"text is empty (possibly HTTP/0.9)"},"status":400}

(Nick Peirson) #3

We're still seeing this on an almost daily basis and it would be great to get to the bottom of it. If there's any more information that would help to diagnose it, let me know and I'll get it added.


(Guy Boertje) #4

@Nick_Peirson
With all this valuable diagnostic data, I think you have found a bug.
Please create an issue on the Github repo of the plugin. Make sure you copy paste the two messages content.


(Nick Peirson) #5

Ok, thanks @guyboertje.

I did raise an issue, but I hadn't recieved any response either there or here, so I wasn't sure which was the correct place to be posting. I'll follow it up on the issue tracker.


(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.