Metricsets are executed more than the period specified in config


(Hussain Kothari) #1

We are seeing a significant spike in data for Metrics. See the index size for today (only 8 hours in) and yesterday versus the largest other index on the box from last Thursday.

This is the monitoring for the metrics index from yesterday. You can see some significant growth in metric indexing starting around 3am and instances where there was 3x the normal indexing rate.
We have set the period in config to be 10sec but It seems the period interval of 10 secs is ignored by metricsbeats and it sends more data from each server (see pic).

metricbeats version: 5.2.0
elastic version : 5.6.0

This is the metricbeat config

metricbeat.modules:

  • module: system
    metricsets:

    • cpu

    • filesystem

    • fsstat

    • memory

    • network

    • process

    enabled: true
    period: 10s
    processes: ['.*']

fields:
environment: ${GMIServer}
platform: ${GMIPlatform}

output.kafka:
hosts: ["dummykafka:9092","dummykafka:9092"]

topic: 'MetricSets'

partition.round_robin:
reachable_only: false

required_acks: 1
compression: gzip
max_message_bytes: 1000000


(Adrian Serrano) #2

Some modules generate more than one event per period.

The process module will generate one event per each individual process. Is it possible that this spike in the number of indexed events corresponds with some spike of activity in the monitored systems?


(Hussain Kothari) #3

Yes im aware process module does create more than one, but this spike is much larger than the normal process events we get. And this is from all the systems we are monitoring. It seems unusual to see such sudden spikes when we have been using this for over a year and we never saw such behaviour.

After basic troubleshooting at our end we found that Beats is sending Duplicate Records to ES.

This image below show data for a single w3wp process on a single server for each interval it has 14 records (weird)

Not sure how or why Beats send same data multiple time which is causing a huge spike in Data.
Any idea why this behaviour by Beats?


(ruflin) #4

If I remember correctly Metricbeat will try to resend data 3 times if it wasn't delivery successfully. Having data more then once probably means something went wrong when sending the bulk request. Either there was a network issue or ES rejected it because it was overloaded.

Could you check the metricbeat and elasticsearch logs around this time?

I'm aware that you mention above 14 times repeated (and not 3 times as I would expect). Lets still check the logs to see if there is something interesting.

Any chance you could share the 14 identical events? (all of them for comparison)


(Hussain Kothari) #5

I checked the logs but got only one error which is repeated throughout the logs. Not sure if its related (see logs below).
Also i can share the 14 events but how do you want it in JSON or should i export it in Excel? (I'm not every familiar with querying ES, plus in JSON its more than 1K lines)

2018-04-24T04:54:02-05:00 INFO Non-zero metrics in the last 30s: fetches.system-fsstat.events=3 fetches.system-fsstat.success=3 fetches.system-network.events=3 fetches.system-filesystem.success=3 fetches.system-memory.success=3 fetches.system-process.success=3 libbeat.kafka.call_count.PublishEvents=6 libbeat.publisher.published_events=543 libbeat.kafka.published_and_acked_events=543 fetches.system-filesystem.events=6 libbeat.publisher.messages_in_worker_queues=543 fetches.system-process.events=525 fetches.system-cpu.success=3 fetches.system-memory.events=3 fetches.system-cpu.events=3 fetches.system-network.success=3
2018-04-24T04:54:29-05:00 ERR Error getting process details. pid=48848: error getting process arguments for pid=48848: ProcArgs failed for pid=48848: could not get Win32_Process WHERE ProcessId = 48848: Process not found
2018-04-24T04:54:32-05:00 INFO Non-zero metrics in the last 30s: libbeat.kafka.call_count.PublishEvents=6 fetches.system-cpu.success=3 fetches.system-filesystem.events=6 fetches.system-fsstat.events=3 fetches.system-fsstat.success=3 fetches.system-memory.success=3 fetches.system-network.success=3 libbeat.publisher.messages_in_worker_queues=545 fetches.system-cpu.events=3 fetches.system-filesystem.success=3 fetches.system-process.success=3 fetches.system-network.events=3 fetches.system-memory.events=3 fetches.system-process.events=527 libbeat.publisher.published_events=545 libbeat.kafka.published_and_acked_events=545
2018-04-24T04:55:02-05:00 INFO Non-zero metrics in the last 30s: libbeat.kafka.published_and_acked_events=544 fetches.system-memory.success=3 fetches.system-memory.events=3 fetches.system-filesystem.events=6 fetches.system-cpu.success=3 fetches.system-cpu.events=3 fetches.system-network.events=3 fetches.system-process.success=3 fetches.system-filesystem.success=3 fetches.system-fsstat.success=3 fetches.system-network.success=3 libbeat.publisher.messages_in_worker_queues=544 fetches.system-fsstat.events=3 fetches.system-process.events=526 libbeat.kafka.call_count.PublishEvents=6 libbeat.publisher.published_events=544
2018-04-24T04:55:32-05:00 INFO Non-zero metrics in the last 30s: fetches.system-filesystem.events=6 fetches.system-network.success=3 fetches.system-process.events=522 libbeat.kafka.call_count.PublishEvents=6 fetches.system-cpu.success=3 fetches.system-fsstat.success=3 fetches.system-memory.events=3 libbeat.kafka.published_and_acked_events=540 fetches.system-cpu.events=3 fetches.system-fsstat.events=3 fetches.system-memory.success=3 fetches.system-filesystem.success=3 fetches.system-network.events=3 fetches.system-process.success=3 libbeat.publisher.published_events=540 libbeat.publisher.messages_in_worker_queues=540


(Hussain Kothari) #6

Here are the 14 Identical Events

Time _id _index metricset.module metricset.name metricset.rtt system.process.pid system.process.ppid system.process.pgid system.process.name system.process.state system.process.memory.size system.process.memory.rss.pct system.process.memory.share system.process.memory.rss.bytes system.process.cpu.total.pct system.process.cpu.start_time type
April 24th 2018, 15:45:19.152 AWL3Lbuw5mla1HtJT7Ae appdata.metricbeat.process-2018.04.24 system process 86,000 3,172 2,452 0 w3wp.exe running 79,79,008 0.001 0 1,45,57,184 0 April 24th 2018, 13:52:36.782 metricsets
April 24th 2018, 15:45:19.152 AWL3KjqT8VmxIl-7IyEZ appdata.metricbeat.process-2018.04.24 system process 86,000 3,172 2,452 0 w3wp.exe running 79,79,008 0.001 0 1,45,57,184 0 April 24th 2018, 13:52:36.782 metricsets
April 24th 2018, 15:45:19.152 AWL3KKahRybLbSr_PL_- appdata.metricbeat.process-2018.04.24 system process 86,000 3,172 2,452 0 w3wp.exe running 79,79,008 0.001 0 1,45,57,184 0 April 24th 2018, 13:52:36.782 metricsets
April 24th 2018, 15:45:19.152 AWL3KbN48VmxIl-7IeDU appdata.metricbeat.process-2018.04.24 system process 86,000 3,172 2,452 0 w3wp.exe running 79,79,008 0.001 0 1,45,57,184 0 April 24th 2018, 13:52:36.782 metricsets
April 24th 2018, 15:45:19.152 AWL3L576RybLbSr_Ttqq appdata.metricbeat.process-2018.04.24 system process 86,000 3,172 2,452 0 w3wp.exe running 79,79,008 0.001 0 1,45,57,184 0 April 24th 2018, 13:52:36.782 metricsets
April 24th 2018, 15:45:19.152 AWL3KwZMmRCYGZTwQiA2 appdata.metricbeat.process-2018.04.24 system process 86,000 3,172 2,452 0 w3wp.exe running 79,79,008 0.001 0 1,45,57,184 0 April 24th 2018, 13:52:36.782 metricsets
April 24th 2018, 15:45:19.152 AWL3K2tpRybLbSr_Q4GH appdata.metricbeat.process-2018.04.24 system process 86,000 3,172 2,452 0 w3wp.exe running 79,79,008 0.001 0 1,45,57,184 0 April 24th 2018, 13:52:36.782 metricsets
April 24th 2018, 15:45:19.152 AWL3Lxx8mRCYGZTwTN_7 appdata.metricbeat.process-2018.04.24 system process 86,000 3,172 2,452 0 w3wp.exe running 79,79,008 0.001 0 1,45,57,184 0 April 24th 2018, 13:52:36.782 metricsets
April 24th 2018, 15:45:19.152 AWL3KTqARybLbSr_Pke9 appdata.metricbeat.process-2018.04.24 system process 86,000 3,172 2,452 0 w3wp.exe running 79,79,008 0.001 0 1,45,57,184 0 April 24th 2018, 13:52:36.782 metricsets
April 24th 2018, 15:45:19.152 AWL3LUBTxZ0mns6ZrBGm appdata.metricbeat.process-2018.04.24 system process 86,000 3,172 2,452 0 w3wp.exe running 79,79,008 0.001 0 1,45,57,184 0 April 24th 2018, 13:52:36.782 metricsets
April 24th 2018, 15:45:19.152 AWL3LNEr8VmxIl-7KYe3 appdata.metricbeat.process-2018.04.24 system process 86,000 3,172 2,452 0 w3wp.exe running 79,79,008 0.001 0 1,45,57,184 0 April 24th 2018, 13:52:36.782 metricsets
April 24th 2018, 15:45:19.152 AWL3K_dMxZ0mns6ZqMyo appdata.metricbeat.process-2018.04.24 system process 86,000 3,172 2,452 0 w3wp.exe running 79,79,008 0.001 0 1,45,57,184 0 April 24th 2018, 13:52:36.782 metricsets
April 24th 2018, 15:45:19.152 AWL3LE3rmRCYGZTwRVVj appdata.metricbeat.process-2018.04.24 system process 86,000 3,172 2,452 0 w3wp.exe running 79,79,008 0.001 0 1,45,57,184 0 April 24th 2018, 13:52:36.782 metricsets
April 24th 2018, 15:45:19.152 AWL3LiisRybLbSr_StEq appdata.metricbeat.process-2018.04.24 system process 86,000 3,172 2,452 0 w3wp.exe running 79,79,008 0.001 0 1,45,57,184 0 April 24th 2018, 13:52:36.782 metricsets

(ruflin) #7

Could you share the raw events in json. Then we would see the full details. Perhaps you can paste it into a gist as discuss will probably not take all of it.

That should also make it visible if it's related to the above error message / pid.

Two important infos I gathered:

  • This is on windows
  • The data goes through kafka (I missed that part before)

What does your full setup look like? It could well be that the data got duplicated somewhere else.


(Hussain Kothari) #8

Raw JSON Events

{
  "took": 35,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 14,
    "max_score": 16.051601,
    "hits": [
      {
        "_index": "dummyindex.metricbeat.process-2018.04.24",
        "_type": "metricsets",
        "_id": "AWL3KjqT8VmxIl-7IyEZ",
        "_version": 1,
        "_score": 16.051601,
        "_source": {
          "@timestamp": "2018-04-24T10:15:19.152Z",
          "system": {
            "process": {
              "cmdline": """C:\Windows\SysWOW64\inetsrv\w3wp.exe -ap "dummysite.com" -v "v2.0" -l "webengine4.dll" -a \\.\pipe\iisipm0c31c3d0-c542-4143-97ba-fc009da5e4f2 -h "D:\inetpub\temp\appPools\dummysite.com\dummysite.com.config" -w "" -m 0 -t 60""",
              "memory": {
                "share": 0,
                "rss": {
                  "pct": 0.0005,
                  "bytes": 14557184
                },
                "size": 7979008
              },
              "pgid": 0,
              "name": "w3wp.exe",
              "cpu": {
                "start_time": "2018-04-24T08:22:36.782Z",
                "total": {
                  "pct": 0
                }
              },
              "pid": 3172,
              "state": "running",
              "ppid": 2452,
              "username": "dummyusername"
            }
          },
          "@version": "1",
          "beat": {
            "name": "dummyserver",
            "hostname": "dummyserver",
            "version": "5.2.0"
          },
          "metricset": {
            "name": "process",
            "rtt": 86000,
            "module": "system"
          },
          "fields": {
            "environment": "dummy",
            "platform": "dummy"
          },
          "type": "metricsets"
        }
      },
      {
        "_index": "dummyindex.metricbeat.process-2018.04.24",
        "_type": "metricsets",
        "_id": "AWL3Lxx8mRCYGZTwTN_7",
        "_version": 1,
        "_score": 16.051601,
        "_source": {
          "@timestamp": "2018-04-24T10:15:19.152Z",
          "system": {
            "process": {
              "cmdline": """C:\Windows\SysWOW64\inetsrv\w3wp.exe -ap "dummysite.com" -v "v2.0" -l "webengine4.dll" -a \\.\pipe\iisipm0c31c3d0-c542-4143-97ba-fc009da5e4f2 -h "D:\inetpub\temp\appPools\dummysite.com\dummysite.com.config" -w "" -m 0 -t 60""",
              "memory": {
                "share": 0,
                "rss": {
                  "pct": 0.0005,
                  "bytes": 14557184
                },
                "size": 7979008
              },
              "pgid": 0,
              "name": "w3wp.exe",
              "cpu": {
                "start_time": "2018-04-24T08:22:36.782Z",
                "total": {
                  "pct": 0
                }
              },
              "pid": 3172,
              "state": "running",
              "ppid": 2452,
              "username": "dummyusername"
            }
          },
          "@version": "1",
          "beat": {
            "name": "dummyserver",
            "hostname": "dummyserver",
            "version": "5.2.0"
          },
          "metricset": {
            "name": "process",
            "rtt": 86000,
            "module": "system"
          },
          "fields": {
            "environment": "dummy",
            "platform": "dummy"
          },
          "type": "metricsets"
        }
      },
      {
        "_index": "dummyindex.metricbeat.process-2018.04.24",
        "_type": "metricsets",
        "_id": "AWL3KKahRybLbSr_PL_-",
        "_version": 1,
        "_score": 16.051601,
        "_source": {
          "@timestamp": "2018-04-24T10:15:19.152Z",
          "system": {
            "process": {
              "cmdline": """C:\Windows\SysWOW64\inetsrv\w3wp.exe -ap "dummysite.com" -v "v2.0" -l "webengine4.dll" -a \\.\pipe\iisipm0c31c3d0-c542-4143-97ba-fc009da5e4f2 -h "D:\inetpub\temp\appPools\dummysite.com\dummysite.com.config" -w "" -m 0 -t 60""",
              "memory": {
                "share": 0,
                "rss": {
                  "pct": 0.0005,
                  "bytes": 14557184
                },
                "size": 7979008
              },
              "pgid": 0,
              "name": "w3wp.exe",
              "cpu": {
                "start_time": "2018-04-24T08:22:36.782Z",
                "total": {
                  "pct": 0
                }
              },
              "pid": 3172,
              "state": "running",
              "ppid": 2452,
              "username": "dummyusername"
            }
          },
          "@version": "1",
          "beat": {
            "name": "dummyserver",
            "hostname": "dummyserver",
            "version": "5.2.0"
          },
          "metricset": {
            "name": "process",
            "rtt": 86000,
            "module": "system"
          },
          "fields": {
            "environment": "dummy",
            "platform": "dummy"
          },
          "type": "metricsets"
        }
      },
      {
        "_index": "dummyindex.metricbeat.process-2018.04.24",
        "_type": "metricsets",
        "_id": "AWL3KbN48VmxIl-7IeDU",
        "_version": 1,
        "_score": 16.051601,
        "_source": {
          "@timestamp": "2018-04-24T10:15:19.152Z",
          "system": {
            "process": {
              "cmdline": """C:\Windows\SysWOW64\inetsrv\w3wp.exe -ap "dummysite.com" -v "v2.0" -l "webengine4.dll" -a \\.\pipe\iisipm0c31c3d0-c542-4143-97ba-fc009da5e4f2 -h "D:\inetpub\temp\appPools\dummysite.com\dummysite.com.config" -w "" -m 0 -t 60""",
              "memory": {
                "share": 0,
                "rss": {
                  "pct": 0.0005,
                  "bytes": 14557184
                },
                "size": 7979008
              },
              "pgid": 0,
              "name": "w3wp.exe",
              "cpu": {
                "start_time": "2018-04-24T08:22:36.782Z",
                "total": {
                  "pct": 0
                }
              },
              "pid": 3172,
              "state": "running",
              "ppid": 2452,
              "username": "dummyusername"
            }
          },
          "@version": "1",
          "beat": {
            "name": "dummyserver",
            "hostname": "dummyserver",
            "version": "5.2.0"
          },
          "metricset": {
            "name": "process",
            "rtt": 86000,
            "module": "system"
          },
          "fields": {
            "environment": "dummy",
            "platform": "dummy"
          },
          "type": "metricsets"
        }
      },

(Hussain Kothari) #9

And this is our setup, We have metricbeats on Windows 2012 R2 servers which sends data for queing in Kafka, then logstash reads data and does some basic filtering (see logstash configs) then data is send to ES.
Its not possible to check in Kafka if there is duplicate data, hence I can't figure out if beats or Logstash is creating duplicates.

Logstash Config

input{
   kafka{
        bootstrap_servers => "dummykafkaserver"
        topics => ["AppData.MetricSets"]
        group_id => "AppData.MetricSets.Logstash"
        consumer_threads => 8
        session_timeout_ms => "30000"
        max_poll_records => "100"
        }
}
filter{
        # parse the message field of the kafka message as JSON
        json{
                source => "message"
        }
}
------------------------------------------------------------------------
filter {
    if [type] == "metricsets" {
    mutate {
            lowercase => ["computer_name", "[beat][hostname]", "[beat][name]", "host"]
            remove_field => "message"
        }
    }
}
output {
    if [type] == "metricsets" {
        if [metricset][name] == "process" {
                        elasticsearch {
                            index => "metricbeat.process-%{+YYYY.MM.dd}"
                                flush_size => 1000000
                                hosts => ["dummyESnode:50000"]
                                user => "dummy"
                                password=>"dummy"
            }
        }
        else if [metricset][name] == "filesystem" {
                    elasticsearch {
                            index => "appdata.metricbeat.filesystem-%{+YYYY.MM.dd}"
                                flush_size => 1000000
                                hosts => ["dummyESnode:50000"]
                                user => "dummy"
                                password=>"dummy"
                    }
            }
        else if [metricset][name] == "cpu" {
                    elasticsearch {
                            index => "appdata.metricbeat.cpu-%{+YYYY.MM.dd}"
                                flush_size => 1000000
                                hosts => ["dummyESnode:50000"]
                                user => "dummy"
                                password=>"dummy"
                    }
            }
        else if [metricset][name] == "memory" {
                    elasticsearch {
                            index => "appdata.metricbeat.memory-%{+YYYY.MM.dd}"
                                flush_size => 1000000
                                hosts => ["dummyESnode:50000"]
                                user => "dummy"
                                password=>"dummy"
                    }
            }
            else if [metricset][name] == "fsstat" {
                    elasticsearch {
                            index => "appdata.metricbeat.fsstat-%{+YYYY.MM.dd}"
                                hosts => ["dummyESnode:50000"]
                                flush_size => 1000000
                                user => "dummy"
                                password=>"dummy"
                    }
            }
    }
}

(ruflin) #10

Based on the logs you shared I assume no data duplication happened between Metricbeat and Kafka but I'm no Kafka expert so I don't know the conditions that are required that duplicated messages would end up in Kafka. I see in your config you have required_acks: 1 and -1 would be needed to make sure it's commited in all replicas. But I think the risk of using "only" 1 is that you could lose data not that it's duplicated.

How many Logstash instance do you have? Did you see anything interesting in the Kafka, Logstash or Elasticsearch logs around this time?

Did you see the issue only once or several times since the first occurence?


(Hussain Kothari) #11

We have only one logstash instance, and yes we saw this issue for over two weeks now.
And the reason we dont have required_acks: -1 is we have huge amount of data around 300 million per day, which might be make performance slow if it waits for ack from all the replicas. (let me know if you think otherwise)

So finally I solved this by adding a filter plugin in logstash. This takes care of not adding the duplicates in ES even if there are multiple retires attempts from either beats or kafka.

Does this solution sounds good to you? Its working for now, but i feel the ideal way should be beats not repeat data even if it retires.

https://www.elastic.co/guide/en/logstash/5.4/plugins-filters-fingerprint.html#plugins-filters-fingerprint-target


(ruflin) #12

Glad you found a solution. I agree on the part that Beats should not duplicate the data. The part I don't understand yet if Metricbeat actually creates the same data multiple times (rather unlikely), resends data or something happens in ingesting in Kafka or pulling the data from Kafka.

Is the duplicated data always from similar servers for example from servers which are under heavy load or a network under heavy load?


(Hussain Kothari) #13

I agree with you, I still haven't found exact root cause for this, as this was in prod i had to take corrective steps soon.
Also Kafka is a blackbox you cant really see the messages in it, so it become difficult to pinpoint the source of duplication.

No, duplicate data a was not limited to any similar servers, we found it in almost all different servers without any high network loads on them. Strangest thing is this issue started out of nowhere, without any major changes to either the servers or any ELK configs.

Anyhow, i will try reproducing the same issue in sandbox and see if i can get any concrete answers

Thanks for all your help :slight_smile:


(ruflin) #14

Great. Ping me when you get find something interesting. I would also look at the logs in beats if there are any errors or the stats report any non acked events. Same on the Logstash side if there were errors pulling data. Unfortunately don't know much about Kafka logs.


(system) #15

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.