Not all metricsets events are captured

Vishnu59 · November 9, 2017, 6:32pm

Hi There,

We are using metricbeat, logstash and ElasticSearch combination to gather system metrics. Last night we had an issue with a VM where CPU utilization pegged up to 100% and that in turn had some kind of affect on metricbeat agent where only data for certain metricsets has been captured and sent across but not for all metricesets. Any chance someone can help me understand why that would happen and if there is a way I can make sure that events for all metricsets are captured at all the times.

Config:
module: system
metricsets:
#CPU stats
- cpu

#System Load stats
- load

#Per CPU core stats
#- core

#IO stats
- diskio

#Per filesystem stats
- filesystem

#File system summary stats
#- fsstat

#Memory stats
- memory

#Network stats
- network

#Per process stats
- process

#Sockets (linux only)
#- socket

enabled: true
period: 300s

logs:
Good:
2017-11-09T01:01:30-07:00 INFO Non-zero metrics in the last 30s: fetches.system-cpu.events=1 fetches.system-cpu.success=1 fetches.system-diskio.events=17 fetches.system-diskio.success=1 fetches.system-filesystem.events=16 fetches.system-filesystem.success=1 fetches.system-load.events=1 fetches.system-load.success=1 fetches.system-memory.events=1 fetches.system-memory.success=1 fetches.system-network.events=2 fetches.system-network.success=1 fetches.system-process.events=183 fetches.system-process.success=1 libbeat.logstash.call_count.PublishEvents=3 libbeat.logstash.publish.read_bytes=36 libbeat.logstash.publish.write_bytes=17329 libbeat.logstash.publish.write_errors=2 libbeat.logstash.published_and_acked_events=221 libbeat.logstash.published_but_not_acked_events=442 libbeat.publisher.messages_in_worker_queues=221 libbeat.publisher.published_events=221`

Bad:
2017-11-09T01:22:03-07:00 INFO Non-zero metrics in the last 30s: fetches.system-cpu.events=1 fetches.system-cpu.success=1 fetches.system-filesystem.events=5 fetches.system-filesystem.success=1 fetches.system-load.events=1 fetches.system-load.success=1 fetches.system-memory.events=1 fetches.system-memory.success=1 fetches.system-network.events=2 fetches.system-network.success=1 fetches.system-process.events=1 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_errors=1 libbeat.logstash.published_and_acked_events=10 libbeat.logstash.published_but_not_acked_events=31 libbeat.publisher.messages_in_worker_queues=8 libbeat.publisher.published_events=82017-11-09T01:22:38-07:00 INFO Non-zero metrics in the last 30s: fetches.system-process.events=2 libbeat.logstash.publish.read_bytes=6 libbeat.logstash.publish.write_bytes=1462 libbeat.logstash.publish.write_errors=1 libbeat.logstash.published_and_acked_events=54 libbeat.logstash.published_but_not_acked_events=2 libbeat.publisher.messages_in_worker_queues=5 libbeat.publisher.published_events=5
2017-11-09T01:23:40-07:00 INFO Non-zero metrics in the last 30s: fetches.system-filesystem.events=1 fetches.system-process.events=1 libbeat.logstash.call_count.PublishEvents=2 libbeat.logstash.publish.read_bytes=12 libbeat.logstash.publish.write_bytes=453 libbeat.logstash.published_and_acked_events=2 libbeat.publisher.messages_in_worker_queues=1 libbeat.publisher.published_events=12017-11-09T01:25:42-07:00 INFO Non-zero metrics in the last 30s: fetches.system-filesystem.events=10 fetches.system-process.events=21 libbeat.logstash.publish.write_errors=2 libbeat.logstash.published_and_acked_events=18 libbeat.logstash.published_but_not_acked_events=32 libbeat.publisher.messages_in_worker_queues=35 libbeat.publisher.published_events=35
2017-11-09T01:25:42-07:00 INFO No non-zero metrics in the last 30s2017-11-09T01:26:47-07:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=2 libbeat.logstash.publish.read_bytes=6 libbeat.logstash.publish.write_bytes=700 libbeat.logstash.publish.write_errors=1 libbeat.logstash.published_and_acked_events=8 libbeat.logstash.published_but_not_acked_events=5
2017-11-09T01:27:08-07:00 INFO Non-zero metrics in the last 30s: fetches.system-cpu.events=1 fetches.system-cpu.success=1 fetches.system-filesystem.events=16 fetches.system-filesystem.success=1 fetches.system-load.events=1 fetches.system-load.success=1 fetches.system-memory.events=1 fetches.system-memory.success=1 fetches.system-network.events=2 fetches.system-network.success=1 libbeat.logstash.call_count.PublishEvents=4 libbeat.logstash.publish.read_bytes=12 libbeat.logstash.publish.write_bytes=1349 libbeat.logstash.publish.write_errors=1 libbeat.logstash.published_and_acked_events=6 libbeat.logstash.published_but_not_acked_events=19 libbeat.outputs.messages_dropped=1 libbeat.publisher.messages_in_worker_queues=21 libbeat.publisher.published_events=212017-11-09T01:27:30-07:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.read_bytes=54 libbeat.logstash.publish.write_bytes=16220 libbeat.logstash.published_and_acked_events=144
2017-11-09T01:28:47-07:00 INFO No non-zero metrics in the last 30s2017-11-09T01:29:00-07:00 INFO No non-zero metrics in the last 30s
2017-11-09T01:29:30-07:00 INFO No non-zero metrics in the last 30s2017-11-09T01:32:20-07:00 INFO Non-zero metrics in the last 30s: fetches.system-filesystem.events=16 fetches.system-filesystem.success=1 libbeat.publisher.messages_in_worker_queues=16 libbeat.publisher.published_events=16

As can be seen in the logs, do not see any entry for "fetches.system-process.success" in the bad event logs. Does that mean data was captured but wasnt able to transmit or if the data itself was never captured. Also any possible options available to avoid being in this situation?

ruflin · November 15, 2017, 12:30am

Which metricbeat version are you using and which OS?

What was the reason the system was so busy? You seen in the logs libbeat.logstash.published_but_not_acked_events=19 and publish errors further down. So it seems the network was overloaded / LS not available and the events were dropped.

Could you please also format your post above with backticks to make it better readable and preserve indentations.

Vishnu59 · November 15, 2017, 11:28pm

Thank you for the info Ruflin. Am using metricbeat-5.6.2 & OS is Red Hat Enterprise Linux Server release 6.9 (Santiago)

As to why the system was busy, we had a run away system process that was causing CPU contention at the VM. That is when I got into metricbeat data to identify the process and found that information was missing.

So in situations like this, is there a way, we can queue up the events and avoid losing the data?

Thanks,

CK

steffens · November 16, 2017, 1:38pm

We're working on spooling events to disk. So if you encounter network errors, all metrics would be buffered on disk. A workaround one can implement today is having the file output in metricbeat and collect the documents via filebeat.

ruflin · November 16, 2017, 10:08pm

@steffens In case he updates to 6.0, could he use retries: -1 or at least set it to a pretty big value? Then events would queue up in memory for now?

steffens · November 16, 2017, 10:51pm

The events would queue up in memory, but only up to queue.mem.events. Then metricbeat will be blocked -> metricbeat will stop collecting data once queue is full.

Vishnu59 · November 20, 2017, 4:44pm

Thank you @steffens and @ruflin for sharing that knowledge. It is very encouraging to hear that work is underway for spooling events to disk in bottleneck situations. At this point, I will do some testing to see the amount of memory it will require to queue up events in-memory and make some decisions accordingly.

Thanks,

Vishnu

system · December 18, 2017, 4:44pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Metricsets are executed more than the period specified in config Beats metricbeat	14	2061	June 1, 2018
Metricbeat: grouping multiple metrics in one event Beats metricbeat	4	1138	November 9, 2017
Metricbeat is not capturing CPU and DISK information Beats metricbeat	21	4844	December 17, 2018
Missing metricset data in Metricbeat Beats metricbeat	3	1758	December 23, 2016
Metricbeat multiple metricsets configuration issue Beats	5	866	August 4, 2017

Not all metricsets events are captured

Related topics