Metricbeat to logstash tcp i/o errors - 5.4.1

Hi There,

I have been investigating this issue for a while now and I have seen lots of topics on here about it but so far nothing is working for me, either old beat version or different architecture/components used.

The scenario im using to test metricbeat is as follows:

  1. metricbeat deployed on 16 server/VMs.
  2. Data is being sent to LS, no filtering, then direct output to ES.
  3. On Kibana i test by isolating for CPU stats which I expect to give me an equal output number of events for each VM pair running the same type of application. Unfortunately it doesn't and looking at the beat logs I can see lots of this
2017-06-19T16:08:26-04:00 ERR Failed to publish events caused by: read tcp ip_deleted:47900- >ip_deleted:5044: i/o timeout
2017-06-19T16:08:26-04:00 INFO Error publishing events (retrying): read tcp ip_deleted:47900->ip_deleted:5044: i/o timeout

Also,

2017-06-19T15:47:56-05:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.read_bytes=6 libbeat.logstash.publish.read_errors=1 libbeat.logstash.publish.write_bytes=997 libbeat.logstash.published_and_acked_events=3 libbeat.logstash.published_but_not_acked_events=1602

Is that a high number for not_acked_events ?

Not sure at this point what would be the major issue and what would be the best possible solution to decrease the number i/o timeouts to LS. LS is not doing anything at this point, no filtering and no field mutation.

  • Could this be related to ES in any way, queue size, ... etc. ?
  • Is this related to the pipeline size on LS due to number of events/VMs so LS is getting very congested, Marvel/Kibana monitoring showing only 2% cpu usage so I guess processing is power is fine ?
  • Or could be beat settings and LS settings that I am missing here, to handle such large data or i/o issues ?

when I run my metricbeat on defaults or with suggested fixes and I set Kibana to display data for the last 15 mins, I still see a big difference in event count for CPU metrics, for two VMs with the same specs and same application/process types, something like 10 to 1. Some times to the point where one of them is Zero events for the last 15 mins !

I appreciate any help as I have been looking at this for a while now, and not sure what i am missing here.

Logstash version?

The error indicates an timeout while filebeat is waiting for ACK or Heartbeat from Logstash. While filebeat is waiting for an ACK, logstash sends a Heartbeat signal (about every 5 seconds I think). The timeout (of 30 seconds) in filebeat is reset upon receiving the Heartbeat. Thanks to the heartbeat signal, filebeat continues waiting for logstash, in case logstash is stuck itself waiting for the outputs to process the data. You can increase the timeout setting in filebeat, but having the connection timing out might also indicate some networking problems (e.g. check with tcpdump if/when something is send between filebeat and logstash).

@steffens Thanks for your response.
Our cluster uses 5.4.1 for all the elastic products, beats, LS, ES and Kibana.

So based on what you have read in my post, do you think what we are trying to look for is valid and expected?
Here is a quick summary of the scenario:
For example if we have 10 servers and if we look at Kibana and check two identical pair of servers we should expect the same/very close number of CPU events, for each of the two servers, for the past 15 mins or so ?

We are planning to use metricbeat in many areas here and would like to know in details the behavior of it if the above is not a valid test.

So based on what you have read in my post, do you think what we are trying to look for is valid and expected?

No idea. if events get published later and metricbeat does not block it's somewhat okish. But still, it seems you have had a network failure. Effects of network failures do depend on metricbeat configuration and how long it takes to clear the queues again (hopefully without any other failures).

metricbeat does retry sending events on failure. The default retry count is 3. On failure, metricbeat will use exponential backoff (starting with 1 second), before trying to send again. Depending in number of failures, timing and such you might end up with duplicate data (due to retries). You can configure infinite retry by setting retries: -1. Depending on queue sizes, the publisher pipeline blocks once the queue is full. Once the queue blocks, the metricbeat modules can not publish any more events -> modules are blocked as well. If blockage occurs for too long, you might miss data. For this use-case we're thinking of adding a disk-based spooler to metricbeat, to overcome long network outages. That is, queue_size, timeout, retries are some settings you can modify in beats.

In LS, one can enable persistent pipelines as well. This setting stores events to disk, before processing is continued. By storing events to disk early, gives LS a chance to report the ACK more timely.

Another problem you might be facing is with Elasticsearch. Like, is it sized correctly for the amount of events you want to insert.

While some counter measures (namely buffering) can be helpful for dealing with outages, you should concentrate on finding the root cause.

Metricbeat is sending events in short bursts (dependent on module period configuration). Taking the average of CPU in metricbeat/LS for a long enough period, you will not see any shortcoming in CPU usage. Plus, if LS is blocked by it's outputs, it basically grinds to a halt.

when I run my metricbeat on defaults or with suggested fixes and I set Kibana to display data for the last 15 mins, I still see a big difference in event count for CPU metrics, for two VMs with the same specs and same application/process types, something like 10 to 1. Some times to the point where one of them is Zero events for the last 15 mins !

well, having multiple servers pushing at once, it's very hard to debug this, even from remote. Try to reduce and narrow down usage to potentially find the causes. e.g. what happens if it's less beats sending data? Is it always the same beats/server with data-loss? and such...

As you don't do any processing in LS, why did you introduce LS?

Thank you for your response, I appreciate the details you have provided.
I will debug more with less servers and see what are the results I get.
I believe using the persistent queue is an option here in this case.

Because we are going to introduce some more additional fields at some point, where we will be adding more data.

@steffens after spending sometime over the past couple of days and testing different scenarios I thought I would share this, hoping I can get some thoughts on why this is happening.

  • I was able to isolate for the issue and its mainly between LS output and the ES input. So im assuming at this point its not connection issue between the beat and LS, based on the below.
  • I changed data nodes storage from being NFS to local disk, then activated the queuing on LS; and I was only able to get a great stream of data for 15 mins, then things started going super slow and tcp i/o errors. I can only see bursts of data every few mins or so, to a point where my data in Kibana is an hour behind.
  • If I use persistent queue or memory with output to local file on LS only then everything is working like a charm. Looking at the beat logs I can see no tcp errors what so eve, between LS and metricbeat.

Unfortunately, ONLY the beat log and marvel (monitoring) is able to tell me whats going on.
Any idea what could be wrong in this case on the ES side making things go really super slow ?

Our cluster is 3 master nodes, 3 data nodes (adding a 4th later), and two client/coordinating nodes.
Any special settings to make my data flow constant and fast enough ?

This sounds like your current bottleneck is with ES and not Metricbeat or Logstash.

For the TCP i/o errors, I wonder about the root-cause. The way beats/LS are supposed to operate, you should not get any timeout errors. That is, unless some intermediate networking device/firewall/NAT did silently close the connection (dropping state?).

Any other TCP i/o errors?

Does increasing the timeout in beats help, or do you get any other i/o errors?

For ES performance questions, you are better off asking in the ES forums.

I changed data nodes storage from being NFS to local disk, then activated the queuing on LS; and I was only able to get a great stream of data for 15 mins, then things started going super slow and tcp i/o errors. I can only see bursts of data every few mins or so, to a point where my data in Kibana is an hour behind.

You did a many changes at once. This makes it hard to track any changes/improvements for a single change applied. e.g. is the (works great for 15 minutes) due to the persistent queue storing events -> is persistent queue full after that time and LS starts blocking. Optimally queues are operated at 'near-empty-state'. This way the queue acts as buffer. If ingest rate is > egest rate at all times, the queue will operate at full-state, at which it has to start blocking.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.