We've got two Logstash nodes that read from redis instances and output to an Elasticsearch cluster. The ES cluster has 8 instances. Three instances store the data on their own servers, three master nodes on their own servers that don't store any indices and two nodes on the same server as the Logstash instances configured purely to handle ingestion.
What we're seeing in the logstash logs is an ES instance being marked as dead due to connection reset
, then immediately being brought back to life when a health check runs. This doesn't seem to affect logs being indexed initially, but eventually we stop getting new logs into the Elasticsearch index. This state then persists until we restart logstash. This affects each logstash instance independently, i.e. one will fail to submit logs while the other instance continues to submit logs with no problems. After restarting a logstash instance we don't see any ES nodes marked as dead in the logstash logs for a while (~1 hour).
Initially I thought this was related to ufw as we were seeing ufw dropping packets in syslog that correlated with the connection reset messages in the logstash logs, however when I disabled ufw on both sides (ES and Logstash servers) we continued to see the connection reset errors in the logstash logs, so the UFW errors appear to by a symptom rather than a cause.
There doesn't appear to be anything of note in the ES logs when the issue occurs.
Logstash logs:
[2018-08-21T00:04:09,982][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch' but Elasticsearch appears to be unreachable or down! {:error_message=>"Elasticsearch Unreachable: [http://elasticsearch4:9200/][Manticore::SocketException] Connection reset", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError", :will_retry_in_seconds=>64}
[2018-08-21T00:04:10,567][INFO ][logstash.outputs.elasticsearch] Running health check to see if an Elasticsearch connection is working {:healthcheck_url=>http://elasticsearch4:9200/, :path=>"/"}
[2018-08-21T00:04:10,575][WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>"http://elasticsearch4:9200/"}
[2018-08-21T00:06:18,786][WARN ][logstash.outputs.elasticsearch] Marking url as dead. Last error: [LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError] Elasticsearch Unreachable: [http://elasticsearch4:9200/][Manticore::SocketException] Broken pipe (Write failed) {:url=>http://elasticsearch4:9200/, :error_message=>"Elasticsearch Unreachable: [http://elasticsearch4:9200/][Manticore::SocketException] Broken pipe (Write failed)", :error_class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError"}
[2018-08-21T00:06:18,786][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch' but Elasticsearch appears to be unreachable or down! {:error_message=>"Elasticsearch Unreachable: [http://elasticsearch4:9200/][Manticore::SocketException] Broken pipe (Write failed)", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError", :will_retry_in_seconds=>64}
[2018-08-21T00:06:20,826][INFO ][logstash.outputs.elasticsearch] Running health check to see if an Elasticsearch connection is working {:healthcheck_url=>http://elasticsearch4:9200/, :path=>"/"}
[2018-08-21T00:06:20,833][WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>"http://elasticsearch4:9200/"}
Syslog:
Aug 21 00:04:09 logstash3 kernel: [1687551.513569] [UFW BLOCK] IN=enp0s31f6 OUT= MAC=90:1b:0e:cc:50:95:30:b6:4f:d8:05:51:08:00 SRC=xx.xx.xx.89 DST=xx.xx.xx.133 LEN=40 TOS=0x00 PREC=0x00 TTL=57 ID=56200 DF PROTO=TCP SPT=9200 DPT=60188 WINDOW=0 RES=0x00 RST URGP=0
Aug 21 00:06:35 logstash3 kernel: [1687697.665295] [UFW BLOCK] IN=enp0s31f6 OUT= MAC=90:1b:0e:cc:50:95:30:b6:4f:d8:05:51:08:00 SRC=xx.xx.xx.234 DST=xx.xx.xx.133 LEN=40 TOS=0x00 PREC=0x00 TTL=61 ID=61257 DF PROTO=TCP SPT=9200 DPT=60070 WINDOW=0 RES=0x00 RST URGP=0
Aug 21 00:06:35 logstash3 kernel: [1687697.665300] [UFW BLOCK] IN=enp0s31f6 OUT= MAC=90:1b:0e:cc:50:95:30:b6:4f:d8:05:51:08:00 SRC=xx.xx.xx.234 DST=xx.xx.xx.133 LEN=40 TOS=0x00 PREC=0x00 TTL=61 ID=61258 DF PROTO=TCP SPT=9200 DPT=60070 WINDOW=0 RES=0x00 RST URGP=0
Aug 21 00:06:49 logstash3 kernel: [1687712.280361] [UFW BLOCK] IN=enp0s31f6 OUT= MAC=90:1b:0e:cc:50:95:30:b6:4f:d8:05:51:08:00 SRC=xx.xx.xx.27 DST=xx.xx.xx.133 LEN=40 TOS=0x00 PREC=0x00 TTL=60 ID=13083 DF PROTO=TCP SPT=9200 DPT=55788 WINDOW=0 RES=0x00 RST URGP=0
Any thoughts on what I should be looking at?