Alerting on watermark threshold - Available API or field inside elasticsearch?

chouben · July 27, 2023, 2:30pm

Hi

I'm looking into a way on how to achieve monitoring of the watermark thresholds. We would like to add alerting once the first threshold passes.
I do not want to use fixed sizes inside the alerts, but would like to use the check mechanism inside elastic which indicates the watermark state. If we reconfigure the watermark thresholds, alerts get broken.

Our set up contains an elasticsearch & grafana for alerting. Grafana simply queries indices for alerting.

If the watermark threshold is passed, we see below message in the logs:

2023-07-27 15:44:05 {"@timestamp":"2023-07-27T13:44:05.882Z", "log.level": "INFO", "message":"low disk watermark [240gb] exceeded on [YficartHQVa8ym5DxQ_p3Q][elasticsearch-8-7-1][/usr/share/elasticsearch/data] free: 234.1gb[93.2%], replicas will not be assigned to this node", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-8-7-1][management][T#2]","log.logger":"org.elasticsearch.cluster.routing.allocation.DiskThresholdMonitor","elasticsearch.cluster.uuid":"gcrubXYdSbO15GHVNFKxoQ","elasticsearch.node.id":"YficartHQVa8ym5DxQ_p3Q","elasticsearch.node.name":"elasticsearch-8-7-1","elasticsearch.cluster.name":"docker-cluster"}

I can't however seem to find any API which indicates the watermark is broken. Is there any API returning such information?

The closest APIs I found:

GET /_health_report/disk
-> It does still return green status on low threshold passed
-> It returns however red state for flood stage
GET _cat/allocation?v=true
-> It returns the disk size, but no indication of the watermark thresholds
metricbeat indices contain: system.filesystem.free
-> We have to duplicate the thresholds configured via watermark to an alert, which is undesired

If such API exists, I could use heartbeat to do a call & parse the response. Any field inside elasticsearch stored by e.g. monitoring could be used too.
Once the data is inside an index, I can query via elasticearch and Grafana could alert.
Reminder: I don't want to use the fixed values in the alerting, because thresholds could change in the future. So that should be transilient.

Thanks!
Christof

chouben · July 28, 2023, 12:05pm

It seems my only option is

Feeding the logfile into elasticsearch via filebeat
Filter the logs via a processor to only ingest WARN & ERROR logs
Query the index & alert

system · August 25, 2023, 12:06pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.