Nvidiagpubeat: Monitor Nvidia GPUs using this beat


(Deepak Jain) #1

nvidiagpubeat can be used to monitor NVIDIA GPUs.

nvidiagpubeat.yml has two configurations

  1. query
    By default, it can query "utilization.gpu,utilization.memory,memory.total,memory.free,memory.used,temperature.gpu,pstate" using nvidia-smi utility installed on Nvidia GPU devices. It can be configured to obtain other supported metrics of nvidia-smi

  2. env: values can be : test/cluster
    nvidiagpubeat can be run locally to test beat functionality. One can set env:test (default) to run nvidiagpubeat locally and verify beat functionality. In test mode nvidiagpubeat will invoke a dummy localnvidiasmi executable that generates mock metrics for default configuration.

I followed https://www.elastic.co/guide/en/beats/libbeat/current/new-beat.html and https://www.elastic.co/guide/en/beats/libbeat/current/community-beats.html in my attempt to write a beat.

Would appreciate feedback/suggestions to improve this beat.

PR : https://github.com/elastic/beats/pull/3794
nvidiagpubeat: https://github.com/deepujain/nvidiagpubeat/


(Deepak Jain) #2

Local Execution Instructions:

git clone https://github.com/deepujain/nvidiagpubeat.git
cd nvidiagpubeat
export PATH=$PATH:.:
./nvidiagpubeat -e -d "*"


(Carlos PĂ©rez Aradros) #3

so cool!

I understand that you want to monitor this stats in GPU computing scenarios?

Cheers


(Deepak Jain) #4

Yes. Anyone with NVIDIA GPUs in her kubernetes (or any other) cluster can monitor GPUs by indexing it into ES cluster with this beat.

I see that kibana visualization is possible once metrics are ingested into ES.


(system) #5