Not all metricsets events are captured

Hi There,

We are using metricbeat, logstash and ElasticSearch combination to gather system metrics. Last night we had an issue with a VM where CPU utilization pegged up to 100% and that in turn had some kind of affect on metricbeat agent where only data for certain metricsets has been captured and sent across but not for all metricesets. Any chance someone can help me understand why that would happen and if there is a way I can make sure that events for all metricsets are captured at all the times.

Config:
module: system
metricsets:
#CPU stats
- cpu

#System Load stats
- load

#Per CPU core stats
#- core

#IO stats
- diskio

#Per filesystem stats
- filesystem

#File system summary stats
#- fsstat

#Memory stats
- memory

#Network stats
- network

#Per process stats
- process

#Sockets (linux only)
#- socket

enabled: true
period: 300s

logs:
Good:
2017-11-09T01:01:30-07:00 INFO Non-zero metrics in the last 30s: fetches.system-cpu.events=1 fetches.system-cpu.success=1 fetches.system-diskio.events=17 fetches.system-diskio.success=1 fetches.system-filesystem.events=16 fetches.system-filesystem.success=1 fetches.system-load.events=1 fetches.system-load.success=1 fetches.system-memory.events=1 fetches.system-memory.success=1 fetches.system-network.events=2 fetches.system-network.success=1 fetches.system-process.events=183 fetches.system-process.success=1 libbeat.logstash.call_count.PublishEvents=3 libbeat.logstash.publish.read_bytes=36 libbeat.logstash.publish.write_bytes=17329 libbeat.logstash.publish.write_errors=2 libbeat.logstash.published_and_acked_events=221 libbeat.logstash.published_but_not_acked_events=442 libbeat.publisher.messages_in_worker_queues=221 libbeat.publisher.published_events=221`

Bad:
2017-11-09T01:22:03-07:00 INFO Non-zero metrics in the last 30s: fetches.system-cpu.events=1 fetches.system-cpu.success=1 fetches.system-filesystem.events=5 fetches.system-filesystem.success=1 fetches.system-load.events=1 fetches.system-load.success=1 fetches.system-memory.events=1 fetches.system-memory.success=1 fetches.system-network.events=2 fetches.system-network.success=1 fetches.system-process.events=1 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_errors=1 libbeat.logstash.published_and_acked_events=10 libbeat.logstash.published_but_not_acked_events=31 libbeat.publisher.messages_in_worker_queues=8 libbeat.publisher.published_events=82017-11-09T01:22:38-07:00 INFO Non-zero metrics in the last 30s: fetches.system-process.events=2 libbeat.logstash.publish.read_bytes=6 libbeat.logstash.publish.write_bytes=1462 libbeat.logstash.publish.write_errors=1 libbeat.logstash.published_and_acked_events=54 libbeat.logstash.published_but_not_acked_events=2 libbeat.publisher.messages_in_worker_queues=5 libbeat.publisher.published_events=5
2017-11-09T01:23:40-07:00 INFO Non-zero metrics in the last 30s: fetches.system-filesystem.events=1 fetches.system-process.events=1 libbeat.logstash.call_count.PublishEvents=2 libbeat.logstash.publish.read_bytes=12 libbeat.logstash.publish.write_bytes=453 libbeat.logstash.published_and_acked_events=2 libbeat.publisher.messages_in_worker_queues=1 libbeat.publisher.published_events=12017-11-09T01:25:42-07:00 INFO Non-zero metrics in the last 30s: fetches.system-filesystem.events=10 fetches.system-process.events=21 libbeat.logstash.publish.write_errors=2 libbeat.logstash.published_and_acked_events=18 libbeat.logstash.published_but_not_acked_events=32 libbeat.publisher.messages_in_worker_queues=35 libbeat.publisher.published_events=35
2017-11-09T01:25:42-07:00 INFO No non-zero metrics in the last 30s2017-11-09T01:26:47-07:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=2 libbeat.logstash.publish.read_bytes=6 libbeat.logstash.publish.write_bytes=700 libbeat.logstash.publish.write_errors=1 libbeat.logstash.published_and_acked_events=8 libbeat.logstash.published_but_not_acked_events=5
2017-11-09T01:27:08-07:00 INFO Non-zero metrics in the last 30s: fetches.system-cpu.events=1 fetches.system-cpu.success=1 fetches.system-filesystem.events=16 fetches.system-filesystem.success=1 fetches.system-load.events=1 fetches.system-load.success=1 fetches.system-memory.events=1 fetches.system-memory.success=1 fetches.system-network.events=2 fetches.system-network.success=1 libbeat.logstash.call_count.PublishEvents=4 libbeat.logstash.publish.read_bytes=12 libbeat.logstash.publish.write_bytes=1349 libbeat.logstash.publish.write_errors=1 libbeat.logstash.published_and_acked_events=6 libbeat.logstash.published_but_not_acked_events=19 libbeat.outputs.messages_dropped=1 libbeat.publisher.messages_in_worker_queues=21 libbeat.publisher.published_events=212017-11-09T01:27:30-07:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.read_bytes=54 libbeat.logstash.publish.write_bytes=16220 libbeat.logstash.published_and_acked_events=144
2017-11-09T01:28:47-07:00 INFO No non-zero metrics in the last 30s2017-11-09T01:29:00-07:00 INFO No non-zero metrics in the last 30s
2017-11-09T01:29:30-07:00 INFO No non-zero metrics in the last 30s2017-11-09T01:32:20-07:00 INFO Non-zero metrics in the last 30s: fetches.system-filesystem.events=16 fetches.system-filesystem.success=1 libbeat.publisher.messages_in_worker_queues=16 libbeat.publisher.published_events=16

As can be seen in the logs, do not see any entry for "fetches.system-process.success" in the bad event logs. Does that mean data was captured but wasnt able to transmit or if the data itself was never captured. Also any possible options available to avoid being in this situation?

Which metricbeat version are you using and which OS?

What was the reason the system was so busy? You seen in the logs libbeat.logstash.published_but_not_acked_events=19 and publish errors further down. So it seems the network was overloaded / LS not available and the events were dropped.

Could you please also format your post above with backticks to make it better readable and preserve indentations.

Thank you for the info Ruflin. Am using metricbeat-5.6.2 & OS is Red Hat Enterprise Linux Server release 6.9 (Santiago)

As to why the system was busy, we had a run away system process that was causing CPU contention at the VM. That is when I got into metricbeat data to identify the process and found that information was missing.

So in situations like this, is there a way, we can queue up the events and avoid losing the data?

Thanks,

CK

We're working on spooling events to disk. So if you encounter network errors, all metrics would be buffered on disk. A workaround one can implement today is having the file output in metricbeat and collect the documents via filebeat.

@steffens In case he updates to 6.0, could he use retries: -1 or at least set it to a pretty big value? Then events would queue up in memory for now?

The events would queue up in memory, but only up to queue.mem.events. Then metricbeat will be blocked -> metricbeat will stop collecting data once queue is full.

Thank you @steffens and @ruflin for sharing that knowledge. It is very encouraging to hear that work is underway for spooling events to disk in bottleneck situations. At this point, I will do some testing to see the amount of memory it will require to queue up events in-memory and make some decisions accordingly.

Thanks,

Vishnu

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.