Access es-hadoop stats from Spark

Hi there,

Realize this is maybe a question better suited for the Spark user group, but figured I'd try here first.

It's awesome that elasticsearch-hadoop collects so many metrics in the process or reading/writing from/to ES. Is it possible to access any of these counters via Spark? Done a little Googling on accessing MR counters in Spark but it doesn't seem like a possibility.

Would help greatly with tuning some of our production jobs to see some of these metrics.

Thanks in advance,
Mike

1 Like

ES-Hadoop exposes the metrics in MR since that's easy to do. Internally however these stats do not depend on MR and can be easily exposed to other runtimes.
In case of Spark, while it does seem to support some metrics, it seems something internal, not really meant to be pluggable - there are no docs around it and there doesn't seem to be any proper UI where one can get access to it (even reading it in the console is not an option as far as I know).

In other words, if there would be a proper, public API to extend the metrics in Spark, ES-Hadoop could expose these there as well.

Thanks @costin! I'll leave a note on the Spark mailing list and link to this thread to see if there are future plans to support a metrics framework for Spark.

Great. Let me know how that goes - maybe there's already something in place that can be used.

Cheers,

Sounds like work may already be started on an Elasticsearch Sink for Spark according to this thread.

A bit annoying that creating a new sink requires you to use the org.apache.spark package though.

I'd be happy to help out here, but not exactly sure how/where you'd plug in a new Source.

From what I can tell, the issue is that these Sinks are for exposing Spark stats to other storages (Spark provides some out of the box). In ES-Hadoop case and, from what I understood in this thread, it's more about exposing the ES-Hadoop stats to Spark itself.
Which doesn't seem to be supported. In other words, Spark can only expose its stats but not allow other apps to extend its stats like Hadoop.

I think your understanding is correct there. But I realize that I misspoke in my earlier comment, what we'd actually be looking to do is create a new Source (see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ExecutorSource.scala) for Spark that offers es-hadoop metrics which could be configured to be written to a configured Sink.

From the existing sources though, I can't see an easy way to instantiate something like an ESHadoopSource without modifying some part of core Spark. I suppose the custom RDDs you've created in the project could be modified directly to support a new source, but users using saveAsNewAPIHadoopFile would need modifications to core Spark (I think).

From the Spark user group:

I misunderstood what you were trying to do. I thought you were just looking to create custom metrics vs looking for the existing Hadoop Output Format counters.

I’m not familiar enough with the Hadoop APIs but I think it would require a change to the SparkHadoopWriter class since it generates the JobContext which is required to read the counters. Then it could publish the counters to the Spark metrics system.

I would suggest going ahead and submitting a JIRA request if there isn’t one already.

JIRA submitted which will hopefully cover autodiscovery of Counters. Not sure if you'll also have to consider adding direct instrumentation to the custom RDD classes as well @costin or if that issue (once solved) would cover things.

Thanks for raising the issue.

In ES-Spark, in particular the RDD/SparkSQL integration there's no problem to call Spark code to publish our counters. Registering a Source might work though at first glance this is currently tied to spark - the metric registry is an active component that seems to be bound to the SparkContext lifecycle. Which is far from ideal.

However I'll poke around to see what can be done - likely however this will be done based on the Spark 2.0 code to reduce the number of breakage inside the code base.

Cheers,

2 Likes

Any workaround or progress to extract metrics?
I created a question in Stackoverflow. I don't think that other people will find other way or a workaround, but to try is free: http://stackoverflow.com/questions/43186685/retrieve-metrics-from-elasticsearch-spark