Topbeat 1.0 stops sending events although it is running

If it detects EOF, can it recover itself after a while by re-establishing a connection?

yes, on error the connection is closed and a new connection with initial window-size of one is created.

I think this is what is going wrong. It is not re-establishing connection for a long period. If my cluster has been red for a day, it gets stuck somewhere.

So topbeat is connecting ok, but connection get's closed by logstash. If cluster is red for a day, logstash should (almost) stop progressing (internal queues) filling up. When internal queues in logstash are full and logstash internal timeout hit beats plugin, the beats plugin closes the connection. Hence the logstash messages:

{:timestamp=>"2015-12-14T17:21:34.418000+0000", :message=>"CircuitBreaker::Close", :name=>"Beats input", :level=>:warn}
{:timestamp=>"2015-12-14T17:21:39.424000+0000", :message=>"CircuitBreaker::rescuing exceptions", :name=>"Beats input", :exception=>LogStash::SizedQueueTimeout::TimeoutError, :level=>:warn}
...

In topbeat, on INFO Error publishing events (retrying): EOF connection is closed. After INFO backoff retry a new TCP connection is established. Once we have a TCP connection established, we probe logstash being okish by sending only one event: Try to publish %!s(int=52) events to logstash with window size %!s(int=1). The message %!s(int=0) events out of %!s(int=52) events sent to logstash indicates nothing was send + again INFO Error publishing events (retrying): EOF indicates the connection was closed server side by the circuit breaker in logstash, so we close the connection on the client side and wait again for backoff seconds...

It is not re-establishing connections in topbeat for a long period which fails, but logstash killing connections trying to send new events to already filled up queues (after configurable timeout). If Elasticsearch cluster is red, logstash internal pipes fill up due to back pressure generated by very slow or non-responding Elasticsearch nodes. The congestion_threshold config in logstash decides when to timeout and kill connections if pipes fill up.

Thanks Steffen. But this does not explain why all Windows topbeat successfully re-establishes with logstash when it comes back, Linux ones do not. Those which do not, have to be killed by kill -9. That would indicate topbeat linux code is getting into some state, which makes it go into coma, not just sleep. Windows one does not have this problem. I never had to restart Windows ones, ever, in last couple of months. My cluster went through all sorts of issues, have been red for days. But when cluster came back, the windows one resumed perfectly fine. Only the linux ones give me all the grief.

From your logs I can tell all topbeat instances successfully re-establish their logstash connections! It is logstash closing the connections upon receiving the first event. The kill with -9 is annoying, but unrelated! There is no special linux vs. windows code path in shipping reports. As long as your logs are spammed with INFO send fail messages, topbeat is actually progressing, it does not lock up.

Do you see the message WARN Ignoring tick(s) due to processing taking longer than one period from time to time?

Did run some more tests with topbeat->logstash->elasticsearch killing and restarting elasticsearch after 24 hours. I always found topbeat to recover ok. I added some connect/close connection messages to shipper code base, so reconnects are more visible:

  • kill elasticsearch
  • wait for looong time (24h)
  • kill topbeat (CTRL-C) -> hangs
  • restart elasticserch -> topbeat sends hanging events and quits
2015/11/05 14:34:48.737104 client.go:76: DBG  connect
2015/11/05 14:34:48.738077 client.go:122: DBG  Try to publish 111 events to logstash with window size 1
2015/11/05 14:35:18.740353 client.go:100: DBG  0 events out of 111 events sent to logstash. Continue sending ...
2015/11/05 14:35:18.740386 client.go:122: DBG  Try to publish 111 events to logstash with window size 1
2015/11/05 14:35:19.807709 client.go:81: DBG  close connection
2015/11/05 14:35:19.807919 client.go:100: DBG  0 events out of 111 events sent to logstash. Continue sending ...
2015/11/05 14:35:19.807938 single.go:75: INFO Error publishing events (retrying): EOF
2015/11/05 14:35:19.807943 single.go:143: INFO send fail
2015/11/05 14:35:19.807952 single.go:150: INFO backoff retry: 1s
2015/11/05 14:35:20.809983 client.go:76: DBG  connect
2015/11/05 14:35:20.810878 client.go:122: DBG  Try to publish 111 events to logstash with window size 1
2015/11/05 14:35:20.815502 client.go:81: DBG  close connection
2015/11/05 14:35:20.815647 client.go:100: DBG  0 events out of 111 events sent to logstash. Continue sending ...
2015/11/05 14:35:20.815742 single.go:75: INFO Error publishing events (retrying): EOF
2015/11/05 14:35:20.815749 single.go:143: INFO send fail
2015/11/05 14:35:20.815754 single.go:150: INFO backoff retry: 2s
2015/11/05 14:35:22.816154 client.go:76: DBG  connect
2015/11/05 14:35:22.816875 client.go:122: DBG  Try to publish 111 events to logstash with window size 1
2015/11/05 14:35:52.821288 client.go:81: DBG  close connection
2015/11/05 14:35:52.821435 client.go:100: DBG  0 events out of 111 events sent to logstash. Continue sending ...
2015/11/05 14:35:52.821456 single.go:75: INFO Error publishing events (retrying): read tcp 10.0.2.15:45872->192.168.188.38:5044: i/o timeout
2015/11/05 14:35:52.821464 single.go:143: INFO send fail
2015/11/05 14:35:52.821472 single.go:150: INFO backoff retry: 4s
2015/11/05 14:35:56.823587 client.go:76: DBG  connect
2015/11/05 14:35:56.824340 client.go:122: DBG  Try to publish 111 events to logstash with window size 1
2015/11/05 14:35:56.827772 client.go:81: DBG  close connection
2015/11/05 14:35:56.827979 client.go:100: DBG  0 events out of 111 events sent to logstash. Continue sending ...
2015/11/05 14:35:56.827993 single.go:75: INFO Error publishing events (retrying): EOF

After restart of elasticsearch all events buffered up get published (if all topbeat instances do so at same time, logstash might get overloaded?):

...

... (did hit character limit)

2015/11/05 14:35:56.828088 client.go:76: DBG  connect
2015/11/05 14:35:56.830796 client.go:122: DBG  Try to publish 112 events to logstash with window size 1
2015/11/05 14:35:56.840095 topbeat.go:145: WARN Ignoring tick(s) due to processing taking longer than one period
2015/11/05 14:36:24.894398 client.go:183: DBG  update current window size: 1
2015/11/05 14:36:24.894420 client.go:100: DBG  1 events out of 112 events sent to logstash. Continue sending ...
2015/11/05 14:36:24.894424 client.go:122: DBG  Try to publish 111 events to logstash with window size 2
2015/11/05 14:36:24.900626 client.go:183: DBG  update current window size: 2
2015/11/05 14:36:24.900649 client.go:100: DBG  2 events out of 111 events sent to logstash. Continue sending ...
2015/11/05 14:36:24.900653 client.go:122: DBG  Try to publish 109 events to logstash with window size 3
2015/11/05 14:36:24.948723 client.go:183: DBG  update current window size: 3
2015/11/05 14:36:24.948746 client.go:100: DBG  3 events out of 109 events sent to logstash. Continue sending ...
2015/11/05 14:36:24.948751 client.go:122: DBG  Try to publish 106 events to logstash with window size 5
2015/11/05 14:36:24.993375 client.go:183: DBG  update current window size: 5
2015/11/05 14:36:24.993754 client.go:100: DBG  5 events out of 106 events sent to logstash. Continue sending ...
2015/11/05 14:36:24.994259 client.go:122: DBG  Try to publish 101 events to logstash with window size 8
2015/11/05 14:36:25.042508 client.go:183: DBG  update current window size: 8
2015/11/05 14:36:25.042970 client.go:100: DBG  8 events out of 101 events sent to logstash. Continue sending ...
2015/11/05 14:36:25.043331 client.go:122: DBG  Try to publish 93 events to logstash with window size 12
2015/11/05 14:36:25.088130 client.go:183: DBG  update current window size: 12
2015/11/05 14:36:25.088156 client.go:100: DBG  12 events out of 93 events sent to logstash. Continue sending ...
2015/11/05 14:36:25.088161 client.go:122: DBG  Try to publish 81 events to logstash with window size 18
2015/11/05 14:36:25.177886 client.go:183: DBG  update current window size: 18
2015/11/05 14:36:25.178187 client.go:100: DBG  18 events out of 81 events sent to logstash. Continue sending ...
2015/11/05 14:36:25.178534 client.go:122: DBG  Try to publish 63 events to logstash with window size 27
2015/11/05 14:36:25.300451 client.go:183: DBG  update current window size: 27

I tested topbeat running on debian jessie.

What's your topbeat config? What's your logstash config? What have you tried to change in your configuration to fix the problem since (e.g. topbeat, logstash, firewalls)?

Actually topbeat stops logging when it hangs. If you could look at the very first post on this thread, you will see topbeat stopped logging since Dec-03 and I checked on Dec-08. When the linux topbeat hangs, it stops logging.

I have used Topbeat1.0.1. Topbeat config is:

input:
  period: 60

  procs: [".*"]

  stats:
    system: true
    proc: true
    filesystem: true

output:

  logstash:
    hosts: ["logstash1.nat.bt.com:5047","logstash2.nat.bt.com:5047","logstash3.nat.bt.com:5047"]

    loadbalance: true

shipper:

logging:

  files:

Unfortunately I can not reproduce it failing and I've no idea where exactly it's supposed to hang.

Next time topbeat hangs (and does not write to syslog anymore) can you get me some stack-traces (like 2 or 3 traces taken every 10 seconds)? Maybe we're lucky and find something interesting in traces.

We added some support for getting stack-traces in 1.1.0. You have to get the most recent 1.1.0 nightlies from here.

In order to get stack-traces you have to start topbeat 1.1.0 with -httpprof 127.0.0.1:6060. This starts an http server listening on port 6060 for introspection. By using 127.0.0.1 we bind to localhost only and port is only accessible from localhost (just use :6060 if you want remote access).

Once enabled you can store a trace via:

curl http://localhost:6060/debug/pprof/goroutine?debug=2 > "trace-$(date '+%Y-%m-%d_%R:%S').txt"