Hi There,
We are using metricbeat, logstash and ElasticSearch combination to gather system metrics. Last night we had an issue with a VM where CPU utilization pegged up to 100% and that in turn had some kind of affect on metricbeat agent where only data for certain metricsets has been captured and sent across but not for all metricesets. Any chance someone can help me understand why that would happen and if there is a way I can make sure that events for all metricsets are captured at all the times.
Config:
module: system
metricsets:
#CPU stats
- cpu
#System Load stats
- load
#Per CPU core stats
#- core
#IO stats
- diskio
#Per filesystem stats
- filesystem
#File system summary stats
#- fsstat
#Memory stats
- memory
#Network stats
- network
#Per process stats
- process
#Sockets (linux only)
#- socket
enabled: true
period: 300s
logs:
Good:
2017-11-09T01:01:30-07:00 INFO Non-zero metrics in the last 30s: fetches.system-cpu.events=1 fetches.system-cpu.success=1 fetches.system-diskio.events=17 fetches.system-diskio.success=1 fetches.system-filesystem.events=16 fetches.system-filesystem.success=1 fetches.system-load.events=1 fetches.system-load.success=1 fetches.system-memory.events=1 fetches.system-memory.success=1 fetches.system-network.events=2 fetches.system-network.success=1 fetches.system-process.events=183 fetches.system-process.success=1 libbeat.logstash.call_count.PublishEvents=3 libbeat.logstash.publish.read_bytes=36 libbeat.logstash.publish.write_bytes=17329 libbeat.logstash.publish.write_errors=2 libbeat.logstash.published_and_acked_events=221 libbeat.logstash.published_but_not_acked_events=442 libbeat.publisher.messages_in_worker_queues=221 libbeat.publisher.published_events=221`
Bad:
2017-11-09T01:22:03-07:00 INFO Non-zero metrics in the last 30s: fetches.system-cpu.events=1 fetches.system-cpu.success=1 fetches.system-filesystem.events=5 fetches.system-filesystem.success=1 fetches.system-load.events=1 fetches.system-load.success=1 fetches.system-memory.events=1 fetches.system-memory.success=1 fetches.system-network.events=2 fetches.system-network.success=1 fetches.system-process.events=1 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_errors=1 libbeat.logstash.published_and_acked_events=10 libbeat.logstash.published_but_not_acked_events=31 libbeat.publisher.messages_in_worker_queues=8 libbeat.publisher.published_events=82017-11-09T01:22:38-07:00 INFO Non-zero metrics in the last 30s: fetches.system-process.events=2 libbeat.logstash.publish.read_bytes=6 libbeat.logstash.publish.write_bytes=1462 libbeat.logstash.publish.write_errors=1 libbeat.logstash.published_and_acked_events=54 libbeat.logstash.published_but_not_acked_events=2 libbeat.publisher.messages_in_worker_queues=5 libbeat.publisher.published_events=5
2017-11-09T01:23:40-07:00 INFO Non-zero metrics in the last 30s: fetches.system-filesystem.events=1 fetches.system-process.events=1 libbeat.logstash.call_count.PublishEvents=2 libbeat.logstash.publish.read_bytes=12 libbeat.logstash.publish.write_bytes=453 libbeat.logstash.published_and_acked_events=2 libbeat.publisher.messages_in_worker_queues=1 libbeat.publisher.published_events=12017-11-09T01:25:42-07:00 INFO Non-zero metrics in the last 30s: fetches.system-filesystem.events=10 fetches.system-process.events=21 libbeat.logstash.publish.write_errors=2 libbeat.logstash.published_and_acked_events=18 libbeat.logstash.published_but_not_acked_events=32 libbeat.publisher.messages_in_worker_queues=35 libbeat.publisher.published_events=35
2017-11-09T01:25:42-07:00 INFO No non-zero metrics in the last 30s2017-11-09T01:26:47-07:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=2 libbeat.logstash.publish.read_bytes=6 libbeat.logstash.publish.write_bytes=700 libbeat.logstash.publish.write_errors=1 libbeat.logstash.published_and_acked_events=8 libbeat.logstash.published_but_not_acked_events=5
2017-11-09T01:27:08-07:00 INFO Non-zero metrics in the last 30s: fetches.system-cpu.events=1 fetches.system-cpu.success=1 fetches.system-filesystem.events=16 fetches.system-filesystem.success=1 fetches.system-load.events=1 fetches.system-load.success=1 fetches.system-memory.events=1 fetches.system-memory.success=1 fetches.system-network.events=2 fetches.system-network.success=1 libbeat.logstash.call_count.PublishEvents=4 libbeat.logstash.publish.read_bytes=12 libbeat.logstash.publish.write_bytes=1349 libbeat.logstash.publish.write_errors=1 libbeat.logstash.published_and_acked_events=6 libbeat.logstash.published_but_not_acked_events=19 libbeat.outputs.messages_dropped=1 libbeat.publisher.messages_in_worker_queues=21 libbeat.publisher.published_events=212017-11-09T01:27:30-07:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.read_bytes=54 libbeat.logstash.publish.write_bytes=16220 libbeat.logstash.published_and_acked_events=144
2017-11-09T01:28:47-07:00 INFO No non-zero metrics in the last 30s2017-11-09T01:29:00-07:00 INFO No non-zero metrics in the last 30s
2017-11-09T01:29:30-07:00 INFO No non-zero metrics in the last 30s2017-11-09T01:32:20-07:00 INFO Non-zero metrics in the last 30s: fetches.system-filesystem.events=16 fetches.system-filesystem.success=1 libbeat.publisher.messages_in_worker_queues=16 libbeat.publisher.published_events=16
As can be seen in the logs, do not see any entry for "fetches.system-process.success" in the bad event logs. Does that mean data was captured but wasnt able to transmit or if the data itself was never captured. Also any possible options available to avoid being in this situation?