Simple Observability of Metrics from Apache Spark

bplies · December 28, 2020, 9:31pm

We're trying to collect basic metrics from Apache Spark. We realize there are Hadoop Metrics provided by ES-Hadoop but it seems just so heavy-handed for what we want. We are not trying to tie in full featured interaction between Hadoop and ES, nor track Spark's RDD, nor really use anything listed on ES-Hadoop's features

Apache Spark's documentation for Monitoring and Instrumentation describes how the internal metrics system is based upon Dropwizard (there is an actual Dropwizard Metricbeat module, but it doesn't seem to apply here). The documentation describes a number of "Sinks" including one for Graphite but not any one specifically for Elasticsearch.

There is an Elasticsearch sink named Spelk out in the wild but it hasn't been maintained in over 4 years, doesn't necessarily format events in an ideal way for Elasticsearch (+ ECS), nor is ILM-aware.

Spark's MetricsServlet exposes basic HTTP endpoints for collecting metrics in JSON format, which sounds pretty good if combined with Metricbeat's HTTP module which will format events really well and is ILM-aware but the endpoints seem pretty dynamic and not something easily expressed in a static metricbeat.yml

What is a good way to go about this? Is there any hope of a simplistic Metricbeat module that can do this? How could the basic HTTP module be reliably used in this situation?

system · January 25, 2021, 11:31pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.