Hi There,
I have been investigating this issue for a while now and I have seen lots of topics on here about it but so far nothing is working for me, either old beat version or different architecture/components used.
The scenario im using to test metricbeat is as follows:
- metricbeat deployed on 16 server/VMs.
- Data is being sent to LS, no filtering, then direct output to ES.
- On Kibana i test by isolating for CPU stats which I expect to give me an equal output number of events for each VM pair running the same type of application. Unfortunately it doesn't and looking at the beat logs I can see lots of this
2017-06-19T16:08:26-04:00 ERR Failed to publish events caused by: read tcp ip_deleted:47900- >ip_deleted:5044: i/o timeout 2017-06-19T16:08:26-04:00 INFO Error publishing events (retrying): read tcp ip_deleted:47900->ip_deleted:5044: i/o timeout
Also,
2017-06-19T15:47:56-05:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.read_bytes=6 libbeat.logstash.publish.read_errors=1 libbeat.logstash.publish.write_bytes=997 libbeat.logstash.published_and_acked_events=3 libbeat.logstash.published_but_not_acked_events=1602
Is that a high number for not_acked_events ?
Not sure at this point what would be the major issue and what would be the best possible solution to decrease the number i/o timeouts to LS. LS is not doing anything at this point, no filtering and no field mutation.
- Could this be related to ES in any way, queue size, ... etc. ?
- Is this related to the pipeline size on LS due to number of events/VMs so LS is getting very congested, Marvel/Kibana monitoring showing only 2% cpu usage so I guess processing is power is fine ?
- Or could be beat settings and LS settings that I am missing here, to handle such large data or i/o issues ?
when I run my metricbeat on defaults or with suggested fixes and I set Kibana to display data for the last 15 mins, I still see a big difference in event count for CPU metrics, for two VMs with the same specs and same application/process types, something like 10 to 1. Some times to the point where one of them is Zero events for the last 15 mins !
I appreciate any help as I have been looking at this for a while now, and not sure what i am missing here.