Get correct disk usage stats ES API vs. x-pack monitoring


#1

Hello all,

I'm trying to improve my Elasticsearch monitoring and would need to be able to reliably query the total free disk space (that is available for Elasticsearch) of the cluster.

In Kibana > Monitoring I see

image

Using the Elasticsearch /_cluster/stats API endpoint I see

$ curl -s -XGET 'http://es_server:9200/_cluster/stats?human&pretty' | jq .nodes.fs
{
"total": "3.4tb",
"total_in_bytes": 3838908841984,
"free": "1.7tb",
"free_in_bytes": 1878063632384,
"available": "1.7tb",
"available_in_bytes": 1878063632384
}

Which shows about 50% free...

I have 20 ES nodes running on 4 machines. On each machine there is one 894GB SSD drive for each ES node. (On each machine there is one more 894GB drive for OS, etc. which I hope to disregard in my calculations...)

This means ES has 20x894GB ~ 17,5 TB of total disk space it can use which does not match up with anything I see from Kibana monitoring or the API.

df -b 1 for one of the SSD drives shows 959727210496
This means that ES API total_in_bytes corresponds to 4 x SSD result with df.

SSD: 959727210496
20 x SSD: 19194544209920
total_in_bytes: 3838908841984
total_in_bytes/SSD: 4

I have default replica and sharding settings, so 5 shards and 1 replica.

How do you guys do it?

Cheers,
AB


#2

Playing around a bit I came to this solution for now...

#!/bin/bash

set -o nounset
set -o errexit

WARN=70
CRIT=80

while getopts w:c: option; do
  case $option in
    w) WARN=$OPTARG;;
    c) CRIT=$OPTARG;;
  esac
done

ES_HOST=$(hostname -s)
DISK_USE=$(curl -s -XGET "http://$ES_HOST:9200/_cat/nodes?h=disk.used_percent")

USED_PERCENT=0
NODE_COUNT=0

for i in $DISK_USE; do
  USED_PERCENT=$(echo $USED_PERCENT + $i | bc)
  (( NODE_COUNT = $NODE_COUNT + 1 ))
  done

DISK_USAGE=$(echo $USED_PERCENT / $NODE_COUNT | bc)

echo Disk space used is $DISK_USAGE percent

if [ "$DISK_USAGE" -gt "$CRIT" ]; then
  echo "Very bad"
  exit 2
elif [ "$DISK_USAGE" -gt "$WARN" ]; then
  echo "Pretty bad"
  exit 1
else
  echo "All good"
  exit 0
fi

As all nodes have the same disk size this should work, right?

$ curl -s -XGET "http://$ES_HOST:9200/_cat/nodes?h=disk.used_percent"
37.41
71.41
75.18
26.79
66.21
30.23
65.56
37.23
24.99
71.71
42.00
77.94
37.48
69.25
24.89
64.06
36.70
66.85
74.68
20.76
$ ./es_disk_available.sh
Disk space used is 51 percent
All good

The question is if I can trust this number as Kibana is reporting something completely different...

-AB


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.