Kibana Error Reporting

I’m interested in other people’s view of the Kibana error reporting, and this is in particular reference to using the DEV TOOLS. With some recent reorganization of indexes I was presented with an error 502. To cut a long story short the issue was related to the disk utilization going above the high water mark.

What was of surprise to me was how meaningless the Kibana error messages were. I believe, when Kibana made use of ES, that a useful error message would probably have ensued, but this was translated to a general error message back to the user by Kibana.

I am just running a one node system on K+ES V9.3.2 and this got me to thinking that if one was to run ES+K on multiple nodes, with all the other issues now possible, what this would be like if issues occurred.

My questions are therefore the following:

Is using Kibana in a multi-node environment a bit of a nightmare when determining why something is not working ?

Do admins get around this by setting up a lot of alerting to pickup faults early ?

Kibana Dev Tools is useful but it's not a substitute for upstream alerting. The 502 error you described is a good illustration of why.

A few examples of what we monitor

  1. Heap utilization, Disk usage at 75% Above : before the high watermark kicks in, not after. We use a Watcher hitting _cat/allocation on a scheduled interval.
  1. Cluster status yellow/red : simplest case, covered by a Watcher polling _cluster/health. Can also be handled via Kibana Stack Monitoring alerts if you have that set up.
  2. Unassigned shards persisting beyond a short grace period : Watcher on _cluster/health?level=shards, or a Kibana Alerting rule against ".monitoring-es-*" indices if Metricbeat is already feeding them.
  3. Node departures : Metricbeat tracking 'elasticsearch.cluster.stats.nodes.count', with a threshold alert in Kibana Alerting. Watcher works here too but Metricbeat gives you the historical trend.
  4. Elevated JVM GC frequency : Metricbeat only, really. This one needs time-series data to be meaningful. a one-shot Watcher query doesn't give you enough context.