Nvidiagpubeat: Monitor Nvidia GPUs using this beat

nvidiagpubeat can be used to monitor NVIDIA GPUs.

nvidiagpubeat.yml has two configurations

  1. query
    By default, it can query "utilization.gpu,utilization.memory,memory.total,memory.free,memory.used,temperature.gpu,pstate" using nvidia-smi utility installed on Nvidia GPU devices. It can be configured to obtain other supported metrics of nvidia-smi

  2. env: values can be : test/cluster
    nvidiagpubeat can be run locally to test beat functionality. One can set env:test (default) to run nvidiagpubeat locally and verify beat functionality. In test mode nvidiagpubeat will invoke a dummy localnvidiasmi executable that generates mock metrics for default configuration.

I followed https://www.elastic.co/guide/en/beats/libbeat/current/new-beat.html and https://www.elastic.co/guide/en/beats/libbeat/current/community-beats.html in my attempt to write a beat.

Would appreciate feedback/suggestions to improve this beat.

PR : https://github.com/elastic/beats/pull/3794
nvidiagpubeat: https://github.com/deepujain/nvidiagpubeat/

1 Like

Local Execution Instructions:

git clone https://github.com/deepujain/nvidiagpubeat.git
cd nvidiagpubeat
export PATH=$PATH:.:
./nvidiagpubeat -e -d "*"

so cool!

I understand that you want to monitor this stats in GPU computing scenarios?

Cheers

Yes. Anyone with NVIDIA GPUs in her kubernetes (or any other) cluster can monitor GPUs by indexing it into ES cluster with this beat.

I see that kibana visualization is possible once metrics are ingested into ES.

1 Like