Topbeat 1.0 stops sending events although it is running

ps aux shows:

root 10592 0.3 7.5 411648 305972 ? Sl Nov29 39:17 ./topbeat -v -e -c topbeat.yml

So it is running.

But the log shows:

2015/12/03 21:35:35.773149 single.go:143: INFO send fail
2015/12/03 21:35:35.773160 single.go:150: INFO backoff retry: 4s
2015/12/03 21:35:44.818988 single.go:75: INFO Error publishing events (retrying): EOF
2015/12/03 21:35:44.819013 single.go:143: INFO send fail
2015/12/03 21:35:44.819024 single.go:150: INFO backoff retry: 8s
2015/12/03 21:35:52.850941 single.go:75: INFO Error publishing events (retrying): EOF
2015/12/03 21:37:12.403316 single.go:75: INFO Error publishing events (retrying): EOF
2015/12/03 21:37:12.403355 single.go:143: INFO send fail
2015/12/03 21:37:12.403367 single.go:150: INFO backoff retry: 1s
2015/12/03 21:37:36.017802 single.go:75: INFO Error publishing events (retrying): EOF
2015/12/03 21:37:36.017835 single.go:143: INFO send fail
2015/12/03 21:37:36.017855 single.go:150: INFO backoff retry: 2s
2015/12/03 21:37:43.070760 single.go:75: INFO Error publishing events (retrying): EOF
2015/12/03 21:37:43.070808 single.go:143: INFO send fail
2015/12/03 21:37:43.070821 single.go:150: INFO backoff retry: 4s
2015/12/03 21:37:52.113313 single.go:75: INFO Error publishing events (retrying): EOF
2015/12/03 21:37:52.113375 single.go:143: INFO send fail
2015/12/03 21:37:52.113388 single.go:150: INFO backoff retry: 8s
2015/12/03 22:38:20.893771 single.go:75: INFO Error publishing events (retrying): EOF
2015/12/03 22:38:20.893881 single.go:143: INFO send fail
2015/12/03 22:38:20.893897 single.go:150: INFO backoff retry: 1s
2015/12/03 22:38:21.926014 single.go:75: INFO Error publishing events (retrying): EOF
2015/12/03 22:38:21.926041 single.go:143: INFO send fail
2015/12/03 22:38:21.926052 single.go:150: INFO backoff retry: 2s
2015/12/03 22:38:28.953338 single.go:75: INFO Error publishing events (retrying): EOF
2015/12/03 22:38:28.953382 single.go:143: INFO send fail
2015/12/03 22:38:28.953395 single.go:150: INFO backoff retry: 4s
2015/12/03 22:38:37.979739 single.go:75: INFO Error publishing events (retrying): EOF
2015/12/03 22:39:36.409017 single.go:75: INFO Error publishing events (retrying): EOF
2015/12/03 22:39:36.409097 single.go:143: INFO send fail
2015/12/03 22:39:36.409111 single.go:150: INFO backoff retry: 1s
2015/12/03 22:39:37.449252 single.go:75: INFO Error publishing events (retrying): EOF
2015/12/03 22:39:37.449280 single.go:143: INFO send fail
2015/12/03 22:39:37.449308 single.go:150: INFO backoff retry: 2s

See the logs are last at Dec-03. Since then it hasn't published any event. But the log does not say why.

I am having this problem from multiple servers. Some servers work, some does not. They are all sending data to a centralized logstash instance. Logstash is running fine as it is receiving events from some servers.

A side question, how can I detect if topbeat has stopped sending events and restart it automatically?

Ideally, you'd want to add debug level tracing to figure out why it's failing to publish.

Are you sending topbeat data through Logstash? Any reason not to send it directly to elasticsearch. It seems like the publisher can't connect to your Logstash / Elasticsearch instance. In case you restart, it sends events again? You can used the -d "*" flag to see more details.

I need to use logstash to 'hide' the elasticsearch from others. I use HAProxy to load balance 3 logstash nodes, each writing to a elasticsearch cluster.

Is there a way topbeat and filebeat can detect logstash is back up, and resume its job? NO matter how long it takes?

all beats try to reconnect to logstash/elasticsearch if remote has become unavailable.

  • Run your beats in debug mode for more info.
  • Check logstash log to see if it has tripped the breaker / slow down in the pipeline (I used to have this issue when I sent topbeat, packetbeat and filebeat via logstash. The issue went away after I switched to sending topbeat and packetbeat events directly to ES and relieve the load from logstash)
1 Like

My logstash log is full of this:

{:timestamp=>"2015-12-14T17:21:34.418000+0000", :message=>"CircuitBreaker::Close", :name=>"Beats input", :level=>:warn}
{:timestamp=>"2015-12-14T17:21:39.424000+0000", :message=>"CircuitBreaker::rescuing exceptions", :name=>"Beats input", :exception=>LogStash::SizedQueueTimeout::TimeoutError, :level=>:warn}
{:timestamp=>"2015-12-14T17:21:39.425000+0000", :message=>"CircuitBreaker::rescuing exceptions", :name=>"Beats input", :exception=>LogStash::SizedQueueTimeout::TimeoutError, :level=>:warn}
{:timestamp=>"2015-12-14T17:21:39.425000+0000", :message=>"Beats input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the current connection and rejecting new connection until the pipeline recover.", :exception=>LogStash::CircuitBreaker::HalfOpenBreaker, :level=>:warn}
{:timestamp=>"2015-12-14T17:21:39.426000+0000", :message=>"Beats input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the current connection and rejecting new connection until the pipeline recover.", :exception=>LogStash::CircuitBreaker::HalfOpenBreaker, :level=>:warn}
{:timestamp=>"2015-12-14T18:22:00.396000+0000", :message=>"CircuitBreaker::rescuing exceptions", :name=>"Beats input", :exception=>LogStash::SizedQueueTimeout::TimeoutError, :level=>:warn}
{:timestamp=>"2015-12-14T18:22:00.398000+0000", :message=>"Beats input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the current connection and rejecting new connection until the pipeline recover.", :exception=>LogStash::CircuitBreaker::HalfOpenBreaker, :level=>:warn}

I am using Logstash 2.0.0, ES 2.1.0

 elasticsearch {
    hosts =>  ["127.0.0.1:9200"]
  }

looks more like a logstash or elasticsearch performance problem.

logstash uses a pipeline of (up to) 3 processing steps. 1. input, 2. filter, 3. output.

The input plugin uses a circuit breaker to kill the connection if filter/output layers generate too much back pressure. All connections handlers pass all events on the same queue, being protected by the circuit breaker. Maybe you've got too many servers reporting to logstash at about the same time. Topbeat sends data in bursts due to acquiring data in evenly spaced periods. That is, there might be short periods in time logstash/elasticsearch is overloaded.

You can try to reduce bulk_max_size in topbeat its logstash output section. Try values of 10, 20 or 50 for example. Advantage of reduced bulk sizes is, less events being passed to circuit breaker at once (but might make little difference if LS is very overloaded).

Anything interesting in logstash/elasticsearch logs?

More findings: topbeat on windows are all running fine and sending data, even if logstash or elasticsearch was having problem and later recovered. The windows version successfully detects availability of logstash and resumes sending events.

But the Linux stops sending events when there's a problem with logstash or elasticsearch. It gets into sleep mode. I cannot even kill it without using -9. So, something is wrong with the linux version.

The way to reproduce it would be:

Setup 3 node EC.
Set replication to 2.
Turn off one node, so that you end up having many unassigned nodes and cluster state goes into Red.
Logstash should be showing pipeline problems.

Wait for a day.

Recover ElasticSearch.
Now you will see the Linux topbeat does not recover, but the Windows topbeat recovers.

I have seen everytime cluster state goes red due to node issues, this problem happens.

Hmmm, that's weird. windows/linux/os x use very same implementaion. I've tested this kind of setup with killing logstash (but only for days).

So, let me recapitulate.

  • You're setup is like: topbeat -> LS -> ES
  • logstash is running, but ES goes down (in this case logstash should generate some backpressure)
  • when ES node goes down you're seeing 'CircuitBreaker' issues in logstash logs
  • next you see topbeat 'INFO send fail' due to connection loss?

After ES node recovering, you still see some error messages in logstash logs? You still see error messages from topbeat? Have you checked with netstat/ss tool if topbeat really tries to establish a connection to remote logstash?

What's your logstash output plugin config in topbeat for windows and linux? Have you tried to reduce bulk_max_size?

Have you considered topbeat to send directly to elasticsearch for testing? Like, does it recover from node failure when sending to Elasticsearch (for me it did)?

Did you find anything else strange in /var/log/syslog ?

Also concerning but unrelated: why is your cluster in red state for a full day? Why is your cluster red every so often?

Hi Steffens,
Here's the last log from a linux topbeat which has stopped sending events since 19th:

2015/12/19 20:08:01.804280 single.go:75: INFO Error publishing events (retrying): EOF
2015/12/19 20:08:06.841387 single.go:75: INFO Error publishing events (retrying): EOF
2015/12/19 20:08:06.841421 single.go:143: INFO send fail
2015/12/19 20:08:06.841439 single.go:150: INFO backoff retry: 1s

and there's one that is working:

2015/12/22 01:25:34.970445 single.go:75: INFO Error publishing events (retrying): EOF
2015/12/22 01:25:34.970538 single.go:143: INFO send fail
2015/12/22 01:25:34.970651 single.go:150: INFO backoff retry: 2s

I have seen in both logs there's an entry:

2015/12/19 01:10:30.928961 single.go:75: INFO Error publishing events (retrying): read tcp 10.187.147.
59:59169->10.35.76.36:5047: i/o timeout

But still one is working after this, another one is not.

The above are both linux. One of the linux is going on, the other has hung.

The one that is hung, kill command does not kill it.

Since now I have some linux ones working, and some linux ones not working, I would eliminate logstash or elasticsearch problem as the culprit, because that would make all of them to fail. I do get elasticsearch to go into red mode every now and then. I sometimes see the topbeat index has some shards in initializing mode and taking a long time to recover. However, any problem with logstash or ES should make all the linux ones to fail, not some.

Is there some way to know if topbeat has been publishing events regularly without using the -d PUBLISH one, which is quite verbose. I would like to have a one liner appended to the log every time it sends the events. I can then write some script to detect if it has been doing its job, if not, kill and restart.

The i/o timeout + backoff in both topbeat services is a problem in general. Topbeat still progresses, but logstash is not processing events in time. Main problem is with logstash not responding in time, for whatever reason. When/which topbeat is affected is totally by luck.

Have you adapted bulk_max_size? Default value might be too big, having topbeat sending too many events per request, being followed by an timeout.

Options to play with:

  • reduce bulk_max_size (the bigger the bulk size, the more likely we will see timeouts)
  • increase/decrease timeout
  • set max_retries = 0 (will drop failed publish events immediately)

I think when trying to kill it, topbeat still hangs in output loop + sleep. Due to queues filling up, the main-loop seems to hang too. I think shutdown sequence can still be improved. Workaround is to kill topbeat always with '-9' (adapt init script). Agreed not optimal, but fortunately topbeat does not require special shutdown handling.

Can you share your complete topbeat config?

You can try running with -d 'logstash'. Some messages you might see on syslog or stderr:

  • %v events out of %v events sent to logstash. Continue sending ...
  • Try to publish %v events to logstash with window size %v

If sending failed it will print '0 events out of ... events sent to logstash'.

Thanks. I think topbeat going into a state where it requires kill -9 is a bug. It should never reach to that state. Maybe fixing that part would fix this recovery problem as well.

Yes, it's a bug. But not critical and will not lead to data-loss. Shutdown is blocked by loop trying to recover.

As long as you get the 'INFO send fail' and 'backoff retry' message in logs regularly, recover is in progress. Problem is logstash not accepting connections or connection being killed/timedout.

Actually logstash is accepting connection because other servers are able to send events. It is that topbeat instance that is in the loop is unable to recover even after several days. In the meantime, logstash and ES has recovered, have done several restarts of them. But the topbeat in hung state never recovers.

More info from the log after enabling debug:

2015/12/24 13:39:02.410085 client.go:79: DBG  %!s(int=0) events out of %!s(int=52) events sent to logstash. Continue sending ...
2015/12/24 13:39:02.410272 client.go:96: DBG  Try to publish %!s(int=52) events to logstash with window size %!s(int=1)
2015/12/24 13:39:32.414539 client.go:79: DBG  %!s(int=0) events out of %!s(int=52) events sent to logstash. Continue sending ...
2015/12/24 13:39:32.414710 client.go:96: DBG  Try to publish %!s(int=52) events to logstash with window size %!s(int=1)
2015/12/24 13:40:02.419098 client.go:79: DBG  %!s(int=0) events out of %!s(int=52) events sent to logstash. Continue sending ...
2015/12/24 13:40:02.419174 client.go:96: DBG  Try to publish %!s(int=52) events to logstash with window size %!s(int=1)
2015/12/24 13:40:32.421813 client.go:79: DBG  %!s(int=0) events out of %!s(int=52) events sent to logstash. Continue sending ...
2015/12/24 13:40:32.421963 client.go:96: DBG  Try to publish %!s(int=52) events to logstash with window size %!s(int=1)
2015/12/24 13:41:02.426085 client.go:79: DBG  %!s(int=0) events out of %!s(int=52) events sent to logstash. Continue sending ...
2015/12/24 13:41:02.426241 client.go:96: DBG  Try to publish %!s(int=52) events to logstash with window size %!s(int=1)

Log shows it is sending 0 events.

Thanks for the debug output.

Problem is logstash did not ACK the events in time. Either due to connection being lost mid in progress. You can try to increase the timeout.

The window size of 1 all the time is a good indicator connection was never in a good state. The beats implements slow-start with increasing window sizes. The window size starts with 1, that is only one event out of the 52 is tried to send to logstash (like checking for healthiness), but send or ACK receive operation timed-out (default timeout for logstash is 30s). In worst case logstash might close connection, but TCP connection being in a bad state (like beat will never see the close and treat as timeout) due to some proxy/nat.

After 3 consecutive timeout errors, the connection is closed by topbeat and an error is returned breaking the lowest level sending loop. If another error but timeout error is detected, the sending loop will be stopped immediately.

After 3 timeouts, the connection mode handler (trying to reconnect) takes over. This is when you see the "INFO send fail" and "INFO backoff retry: 8s" messages. The handler closes the connection, waits for backoff seconds, reconnects and tries to send the events one more time (again starting with only one event to check healthiness of connection). After max_retries + 1 sending attempts, the events are dropped and the next batch in queue will be tried.

I think debug/info output in logstash output plugin must be improved, so we can track down the actual occurrence of failed sending attempts. I can not tell for example if error is due to timeout or connection loss (as being closed by logstash). Will do so the next days.

As long as you don't see "Connecting error publishing events" in your logs, reconnecting on TCP/TLS level always succeeds. Connection might be closed by logstash it's circuit breaker though.

You can try to increase the timeout (e.g. to 120 seconds?). As long as 0 events are send for window sizes of 1, there was never a "stable" connection (including connection not lost and all actions to be executed within configured time window) from topbeat to logstash.

It is this reconnect-loop + sleeping for backoff seconds which blocks topbeat shutdown process and requiring you to kill with -9 (which I agree is a bug). But topbeat itself is still progressing.

I have written this in a cron to run every 10 mins and restart topbeat if it is not sending events to logstash for 10 mins. This so far is working:

dd=`date -d'now-10 minutes' +%Y/%m/%d\ %H:%M:%S`
tail -n 1000 topbeat.log | awk -vDate="$dd" '$1" "$2 > Date {print $0}' | grep 'sent to logstash' | grep -v 'int=0'
if [ $? -eq 0 ]
then
        echo "Topbeat sending events successfully"
else
        echo "Kill existing topbeat"
        ps aux | grep topbeat.yml | grep -v "grep" | awk '{print $2}' | xargs kill
        echo "Restart topbeat"
        nohup ./run.sh > topbeat.log &
fi

Oh, just found we do report errors:

INFO Error publishing events (retrying): ...

if it says EOF, the TCP connection was closed either by logstash or some other network equipment (e.g. firewall)?

Still debug output should be improved so we can tell in which phase connection was lost.

Nice. Would love to see topbeat.log after some restarts (can you use logrotation to keep the last N topbeat.log files)?