Metricbeat aws cpu total pct missing

Hi @Vinicios_Grein :slightly_smiling_face:

That really looks like some network issue on AWS, maybe they are in a different network group. Anyways I'm summoning @Kaiyan_Sheng that maybe has more info about this.

@Vinicios_Grein Could you show us your metricbeat config for aws module please? I don't think we got any special config there hmm. Did you check on AWS cloudwatch metrics portal to see if there are CPU metrics there at similar timestamp with the same instances?

Hi @Kaiyan_Sheng ,
my cfg file is basecaly the default, with my credentials:

  • module: aws
    period: 300s
    metricsets:
    • ec2
      access_key_id: '{AWS_ACCESS_KEY_ID:"my_ak"}' secret_access_key: '{AWS_SECRET_ACCESS_KEY:"my_ak"}'
      session_token: '{AWS_SESSION_TOKEN:""}' default_region: '{AWS_REGION:sa-east-1}'

From this line to botton I have put a "#" to comment.

I'll check it out AWS cloudwatch for issues too

Yeah ok, thanks! Definitely check cloudwatch aws portal to see if the AWS is reporting the missing CPUUtilization metrics.

Thanks @Kaiyan_Sheng

Aws Checked. Ther are the metrics for all instances.

I've copied this from kibana logs:

The left side is instance that is cpu is missing, but have some fields that don't have on right side, where exists cpu metric. Where this fields are setted?

Hmm very interesting, there might be a bug in the code then. I will try to reproduce it on my side! Thanks for verifying.

In the mean time, maybe you can give cloudwatch module a try to collect ec2 metrics:

- module: aws
  period: 300s
  metricsets:
    - cloudwatch
  metrics:
    - namespace: AWS/EC2
      tags.resource_type_filter: ec2:instance
      statistic: ["Average", "Maximum"]
1 Like

Hi @Kaiyan_Sheng,
I have activated cloudwatch on aws.yml and desactivated EC2 metrics.
On kibana I've filtered 3 instances. The first one with problem (same instance from ec2), that doens't appear on log, and the other ones appears.

Hi @Vinicios_Grein, sorry I just got back from thanksgiving break. Will continue here to try reproduce the issue.

With cloudwatch metricset, do you mean you are seeing the same issue? With metrics missing from one specific instance?

Hi @Kaiyan_Sheng,
that's ok :slight_smile:

I was trying another think, but no success.

When I tried ec2 metric, 3 instances between 20 has no "ec2.cpu.total.pct" filed on log.
I've activated cloudwatch metrics as you suggested, but in this same 3 instances the field is missing, in this case "aws.metrics.CPUUtilization".

I tried to configure a new user on aws with this settings:
image

Hmm Im sorry I can't reproduce it on my aws account hmm I have 12 EC2 instances spread in several different regions and all of their cloudwatch metrics get collected.

Are these 3 instances from the same region in your account? Are they in running state?

Yes, they are "running" and region is "sa-east-1".
I've tried to reinstall the metricbeat on my instance but the result was the same.
Now I tryed to install in another instance. One of this 3 instances start to show the metrics correctely, but the other ones no. All the other instances continue appearing correctely.
Are there any limit of number of instances per metricbeat control?
How can I configure on ec2 or cloudwatch specifics instances ids?

@Kaiyan_Sheng

I tried another think. Started cloudwatch in metricbeat with all instances with problem and some others that's normal. In this time all fiels appeared. I think there are some kind a limit on monitoring. I've counted and exists 36 instances in my accont, I told you 20, sorry.
Can you confirm if there exists this limit and if so, how much instances can the metricbeat support?

@Vinicios_Grein Thank you so much for investigating this!! Sounds like we hit a limit hmmm

Maybe because the collection period is shorter than how long it takes to collect all from the instances. What is the period you set right now for metricbeat? If that's the case, several metricbeats running in parallel should help with specific regions for each metricbeat to collect from. Or maybe just specify specific regions for different sessions in aws.yml. For example:

- module: aws
  period: 300s
  metricsets:
    - ec2
  credential_profile_name: test-mb
  regions:
    - us-east-1
    - us-east-2
- module: aws
  period: 300s
  metricsets:
    - ec2
  credential_profile_name: test-mb
  regions:
    - us-west-1
    - us-west-2
- module: aws
  period: 300s
  metricsets:
    - ec2
  credential_profile_name: test-mb
  regions:
    - sa-east-1
    - ap-southeast-1

This is not running metricbeats in parallel. If you want to try run multiple metricbeats, you can download 3 metricbeat binaries, and then separate the config above into 3 different aws.yml in 3 metricbeats. I think that will solve the problem unless the limit is on AWS side. This is not an ideal solution, I'm looking into it for a better way to solve this. Thanks!!

The period is setted to 300s, I tried 60s and 600s.
All the instances are in "sa-east-1".

image

I'll configure by cloudwatch all necessary instances to see results.

Thank you so much for trying!! I just created a github issue to track this problem: https://github.com/elastic/beats/issues/14926

I'm currently trying to reproduce it in my aws test account. I have 53 instances created but still haven't seen this issue. Will create more in different regions and see if that changes things.

Sorry @Vinicios_Grein just want to make sure I didn't miss this.

I'm seeing one of the instances from previous collection period shows empty aws.ec2.cpu.total.pct value. But it reports other values like aws.ec2.cpu.credit_balance just fine.

When you see an empty aws.ec2.cpu.total.pct, do you see values for other metrics for the same instance?

One more question, any missing EC2 instances in your environment are ECS related? Not sure if it matters but just checking. Thanks!

Thans @Kaiyan_Sheng so much until now, you are the best!

There is no ECS related, just EC2.

Here is a full log from instance:

Blockquote
{
"_index": "metricbeat-7.4.2-2019.12.03-000002",
"_type": "_doc",
"_id": "i1R-zm4BBQOQ-PUcrajf",
"_version": 1,
"_score": null,
"_source": {
"@timestamp": "2019-12-04T01:20:23.957Z",
"service": {
"type": "aws"
},
"ecs": {
"version": "1.1.0"
},
"host": {
"name": "ip-",
"hostname": "ip-",
"architecture": "x86_64",
"os": {
"platform": "ubuntu",
"version": "16.04.5 LTS (Xenial Xerus)",
"family": "debian",
"name": "Ubuntu",
"kernel": "4.4.0-1098-aws",
"codename": "xenial"
},
"id": "ec2e96c3921ea11e248df16a6b71f076",
"containerized": false
},
"agent": {
"type": "metricbeat",
"ephemeral_id": "0c552575-a6c0-407b-bd09-3295eb10d7b3",
"hostname": "ip-",
"id": "2c7abdac-9a7d-4562-b0b8-76637a1b0dc9",
"version": "7.4.2"
},
"cloud": {
"instance": {
"id": "i-02bd802bbd4002504"
},
"machine": {
"type": "c5.xlarge"
},
"availability_zone": "sa-east-1a",
"provider": "aws",
"region": "sa-east-1"
},
"event": {
"dataset": "aws.ec2",
"module": "aws",
"duration": 14851523222
},
"metricset": {
"name": "ec2",
"period": 600000
},
"aws": {
"ec2": {
"status": {
"check_failed_system": 0,
"check_failed_instance": 0
},
"cpu": {
"total": {}
},
"instance": {
"state": {
"name": "running",
"code": 16
},
"monitoring": {
"state": "disabled"
},
"core": {
"count": 2
},
"threads_per_core": 2,
"public": {
"ip": "",
"dns_name": "ec2-.sa-east-1.compute.amazonaws.com"
},
"private": {
"dns_name": "ip-.sa-east-1.compute.internal",
"ip": ""
},
"image": {
"id": "ami-10186f7c"
}
},
"diskio": {
"read": {},
"write": {}
},
"network": {
"in": {
"bytes": 4803463.2,
"packets": 18453.5,
"bytes_per_sec": 16011.544,
"packets_per_sec": 61.51166666666666
},
"out": {
"packets": 20906.9,
"bytes": 6373195.7,
"bytes_per_sec": 21243.985666666667,
"packets_per_sec": 69.68966666666667
}
}
},
"tags": {
"fin_tipo": "PRD",
"fin_aplicacao": "App",
"Name": "Cloud Ubuntu",
"monitoramento": "sim",
"Aplicacao": "Sis"
}
}
},
"fields": {
"@timestamp": [
"2019-12-04T01:20:23.957Z"
]
},
"highlight": {
"cloud.instance.id": [
"@kibana-highlighted-field@i-02bd802bbd4002504@/kibana-highlighted-field@"
]
},
"sort": [
1575422423957
]
}

And here one that's fine:

Blockquote
{
"_index": "metricbeat-7.4.2-2019.12.03-000002",
"_type": "_doc",
"_id": "kFWHzm4BBQOQ-PUc2STq",
"_version": 1,
"_score": null,
"_source": {
"@timestamp": "2019-12-04T01:30:23.957Z",
"cloud": {
"region": "sa-east-1",
"instance": {
"id": "i-08708303e4916fc15"
},
"machine": {
"type": "m5.2xlarge"
},
"availability_zone": "sa-east-1c",
"provider": "aws"
},
"metricset": {
"period": 600000,
"name": "ec2"
},
"event": {
"module": "aws",
"duration": 15019251360,
"dataset": "aws.ec2"
},
"aws": {
"tags": {
"Aplicacao": "",
"Name": "Cloud6",
"fin_tipo": "PRD",
"fin_aplicacao": "App2"
},
"ec2": {
"cpu": {
"total": {
"pct": 4.6
}
},
"instance": {
"public": {
"ip": "",
"dns_name": "ec2-.sa-east-1.compute.amazonaws.com"
},
"private": {
"ip": "",
"dns_name": "ip-.sa-east-1.compute.internal"
},
"image": {
"id": "ami-0e4e25c13f561aca0"
},
"state": {
"name": "running",
"code": 16
},
"monitoring": {
"state": "disabled"
},
"core": {
"count": 4
},
"threads_per_core": 2
},
"diskio": {
"write": {},
"read": {}
},
"network": {
"in": {
"packets_per_sec": 22.048333333333332,
"packets": 6614.5,
"bytes": 1583963.4,
"bytes_per_sec": 5279.878
},
"out": {
"packets": 5710.8,
"packets_per_sec": 19.036
}
},
"status": {
"check_failed_system": 0,
"check_failed": 0,
"check_failed_instance": 0
}
}
},
"service": {
"type": "aws"
},
"host": {
"name": "ip-",
"os": {
"kernel": "4.4.0-1098-aws",
"codename": "xenial",
"platform": "ubuntu",
"version": "16.04.5 LTS (Xenial Xerus)",
"family": "debian",
"name": "Ubuntu"
},
"id": "ec2e96c3921ea11e248df16a6b71f076",
"containerized": false,
"hostname": "ip-",
"architecture": "x86_64"
},
"agent": {
"ephemeral_id": "0c552575-a6c0-407b-bd09-3295eb10d7b3",
"hostname": "ip-",
"id": "2c7abdac-9a7d-4562-b0b8-76637a1b0dc9",
"version": "7.4.2",
"type": "metricbeat"
},
"ecs": {
"version": "1.1.0"
}
},
"fields": {
"@timestamp": [
"2019-12-04T01:30:23.957Z"
]
},
"highlight": {
"cloud.instance.id": [
"@kibana-highlighted-field@i-08708303e4916fc15@/kibana-highlighted-field@"
]
},
"sort": [
1575423023957
]
}

Hi @Kaiyan_Sheng
Thank you so much for all.

Using cloudwatch metric with "dimensions" is working for me.
I've put on aws.yml all the instances that I needed, like this:

- module: aws
  period: 60s
  metricsets:
    - cloudwatch
  access_key_id: "my_keyID"
  secret_access_key: "my_AK"
  default_region: '${AWS_REGION:sa-east-1}'
  metrics:
    - namespace: AWS/EC2
      name: ["CPUUtilization"]
      tags.resource_type_filter: ec2:instance
      statistic: ["Average", "Maximum"]
      dimensions:
        - name: InstanceId
          value: i-018716881a96f13e4 
    - namespace: AWS/EC2
      name: ["CPUUtilization"]
      tags.resource_type_filter: ec2:instance
      statistic: ["Average", "Maximum"]
      dimensions:
        - name: InstanceId
          value: i-08751c22a26621e72

@Vinicios_Grein Thank you!! I finally was able to reproduce it and found the bug. I missed a pagination in one of the AWS APIs. Will push the fix up for review shortly. Thanks!