I have a RHEL 6 Linux server (10.230.143.160) running ELK 5.3 and a Windows client (10.230.143.31) with Filebeat shipping logs to the server.
I’m seeing a very consistent issue where every few minutes the Filebeat log shows a message like the following:
ERR Failed to publish events caused by: write tcp 10.230.143.31:49778->10.230.143.160:5044: wsasend: An existing connection was forcibly closed by the remote host.
In context it looks something like this (with two such error occurrences shown):
16:59:15.678285 metrics.go:34: INFO No non-zero metrics in the last 30s
16:59:45.677992 metrics.go:34: INFO No non-zero metrics in the last 30s
16:59:56.159587 sync.go:85: ERR Failed to publish events caused by: write tcp 10.230.143.31:56623->10.230.143.160:5044: wsasend: An existing connection was forcibly closed by the remote host.
16:59:56.161588 single.go:91: INFO Error publishing events (retrying): write tcp 10.230.143.31:56623->10.230.143.160:5044: wsasend: An existing connection was forcibly closed by the remote host.
17:00:15.676703 metrics.go:39: INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=3 libbeat.logstash.publish.read_bytes=12 libbeat.logstash.publish.write_bytes=793 libbeat.logstash.publish.write_errors=1 libbeat.logstash.published_and_acked_events=2 libbeat.logstash.published_but_not_acked_events=1 libbeat.publisher.published_events=2 publish.events=2 registrar.states.update=2 registrar.writes=2
17:00:45.677425 metrics.go:34: INFO No non-zero metrics in the last 30s
{.....}
17:04:15.677447 metrics.go:34: INFO No non-zero metrics in the last 30s
17:04:45.677167 metrics.go:34: INFO No non-zero metrics in the last 30s
17:05:01.197065 sync.go:85: ERR Failed to publish events caused by: write tcp 10.230.143.31:57350->10.230.143.160:5044: wsasend: An existing connection was forcibly closed by the remote host.
17:05:01.199057 single.go:91: INFO Error publishing events (retrying): write tcp 10.230.143.31:57350->10.230.143.160:5044: wsasend: An existing connection was forcibly closed by the remote host.
17:05:15.677883 metrics.go:39: INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=3 libbeat.logstash.publish.read_bytes=12 libbeat.logstash.publish.write_bytes=780 libbeat.logstash.publish.write_errors=1 libbeat.logstash.published_and_acked_events=2 libbeat.logstash.published_but_not_acked_events=1 libbeat.publisher.published_events=2 publish.events=2 registrar.states.update=2 registrar.writes=2
17:05:45.677601 metrics.go:34: INFO No non-zero metrics in the last 30s
17:06:15.678323 metrics.go:34: INFO No non-zero metrics in the last 30s
This basic pattern repeats over and over again.
Though I wasn’t able to find anything in other ELK logs to explain why this is happening, I used tcpdump on the server to analyze the TCP traffic. This revealed that TCP “RST/reset” messages are being sent – (seemingly from Logstash to the Windows client) – exactly 1 minute after a period of inactivity. A few minutes later, when Filebeat next attempts to send a message, it discovers the connection has been killed, writes the “ERR Failed to publish events” message to the Filebeat log, and initiates a new connection (with a new port number). And the sequence repeats.
Here is an annotated section of tcpdump output:
(We start in the middle of some communications on client port 56623.)
12:54:36.456450 IP 10.230.143.31.56623 >10.230.143.160.5044: Flags [P.], seq 399:789, ack 7, win 2053, length 390
12:54:36.458536 IP 10.230.143.160.5044 >10.230.143.31.56623: Flags [P.], seq 7:13, ack 789, win 131, length 6
12:54:36.475740 IP 10.230.143.31.56623 >10.230.143.160.5044: Flags [.], ack 13, win 2053, length 0
12:55:36.463928 IP 10.230.143.160.5044 >10.230.143.31.56623: Flags [R.], seq 13, ack 789, win 131, length 0
(Reset has occurred – see “Flags [R.]” above, which indicates a TCP “RST” of the connection on client port 56623; note that it is exactly 1 minute after the prior message.)
(The next message – a few minutes after the reset -- starts a new connection on client port 57350.)
12:59:22.495230 IP 10.230.143.31.57350 >10.230.143.160.5044: Flags [SEW], seq 665829944, win 8192, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0
12:59:22.495270 IP 10.230.143.160.5044 >10.230.143.31.57350: Flags [S.E], seq 1306193595, ack 665829945, win 14600,options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
12:59:22.495455 IP 10.230.143.31.57350 >10.230.143.160.5044: Flags [.], ack 1, win 2053, length 0
12:59:22.496064 IP 10.230.143.31.57350 >10.230.143.160.5044: Flags [P.], seq 1:398, ack 1, win 2053, length 397
12:59:22.496082 IP 10.230.143.160.5044 >10.230.143.31.57350: Flags [.], ack 398, win 123, length 0
12:59:22.499706 IP 10.230.143.160.5044 >10.230.143.31.57350: Flags [P.], seq 1:7, ack 398, win 123, length 6
12:59:22.515098 IP 10.230.143.31.57350 >10.230.143.160.5044: Flags [.], ack 7, win 2053, length 0
12:59:31.491422 IP 10.230.143.31.57350 >10.230.143.160.5044: Flags [P.], seq 398:794, ack 7, win 2053, length 396
12:59:31.493391 IP 10.230.143.160.5044 > 10.230.143.31.57350:Flags [P.], seq 7:13, ack 794, win 131, length 6
12:59:31.501620 IP 10.230.143.31.57350 >10.230.143.160.5044: Flags [.], ack 13, win 2053, length 0
13:00:31.492494 IP 10.230.143.160.5044 >10.230.143.31.57350: Flags [R.], seq 13, ack 794, win 131, length 0
(Another reset has occurred, again exactly 1 minute after the last activity. The connection on client port 57350 is now dead.)
(The next message – a few minutes after the reset -- starts a new connection on client port 58111.)
13:04:27.532702 IP 10.230.143.31.58111 >10.230.143.160.5044: Flags [SEW], seq 93270169, win 8192, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0
13:04:27.532765 IP 10.230.143.160.5044 >10.230.143.31.58111: Flags [S.E], seq 668330447, ack 93270170, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
13:04:27.532979 IP 10.230.143.31.58111 >10.230.143.160.5044: Flags [.], ack 1, win 2053, length 0
13:04:27.535877 IP 10.230.143.31.58111 >10.230.143.160.5044: Flags [P.], seq 1:387, ack 1, win 2053, length 386
(And so on…)
Any thoughts on what might be causing this? Or how to resolve it?
It seems to me to be occurring at the TCP level (rather than at the ELK level). And yet, the logs show that the TCP reset is coming from port 5044 (which is Logstash). So is Logstash initiating it? Are the Linux networking libraries (used by Logstash) initiating it? Is it happening as a result of some invalid message (or lack of message) from Filebeat on the client side? So many questions.
Thanks,
Greg