Access es-hadoop stats from Spark


(Mike Sukmanowsky) #1

Hi there,

Realize this is maybe a question better suited for the Spark user group, but figured I'd try here first.

It's awesome that elasticsearch-hadoop collects so many metrics in the process or reading/writing from/to ES. Is it possible to access any of these counters via Spark? Done a little Googling on accessing MR counters in Spark but it doesn't seem like a possibility.

Would help greatly with tuning some of our production jobs to see some of these metrics.

Thanks in advance,
Mike


(Costin Leau) #2

ES-Hadoop exposes the metrics in MR since that's easy to do. Internally however these stats do not depend on MR and can be easily exposed to other runtimes.
In case of Spark, while it does seem to support some metrics, it seems something internal, not really meant to be pluggable - there are no docs around it and there doesn't seem to be any proper UI where one can get access to it (even reading it in the console is not an option as far as I know).

In other words, if there would be a proper, public API to extend the metrics in Spark, ES-Hadoop could expose these there as well.


(Mike Sukmanowsky) #3

Thanks @costin! I'll leave a note on the Spark mailing list and link to this thread to see if there are future plans to support a metrics framework for Spark.


(Costin Leau) #4

Great. Let me know how that goes - maybe there's already something in place that can be used.

Cheers,


(Mike Sukmanowsky) #5

Sounds like work may already be started on an Elasticsearch Sink for Spark according to this thread.

A bit annoying that creating a new sink requires you to use the org.apache.spark package though.

I'd be happy to help out here, but not exactly sure how/where you'd plug in a new Source.


(Costin Leau) #6

From what I can tell, the issue is that these Sinks are for exposing Spark stats to other storages (Spark provides some out of the box). In ES-Hadoop case and, from what I understood in this thread, it's more about exposing the ES-Hadoop stats to Spark itself.
Which doesn't seem to be supported. In other words, Spark can only expose its stats but not allow other apps to extend its stats like Hadoop.


(Mike Sukmanowsky) #7

I think your understanding is correct there. But I realize that I misspoke in my earlier comment, what we'd actually be looking to do is create a new Source (see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ExecutorSource.scala) for Spark that offers es-hadoop metrics which could be configured to be written to a configured Sink.

From the existing sources though, I can't see an easy way to instantiate something like an ESHadoopSource without modifying some part of core Spark. I suppose the custom RDDs you've created in the project could be modified directly to support a new source, but users using saveAsNewAPIHadoopFile would need modifications to core Spark (I think).


(Mike Sukmanowsky) #8

From the Spark user group:

I misunderstood what you were trying to do. I thought you were just looking to create custom metrics vs looking for the existing Hadoop Output Format counters.

I’m not familiar enough with the Hadoop APIs but I think it would require a change to the SparkHadoopWriter class since it generates the JobContext which is required to read the counters. Then it could publish the counters to the Spark metrics system.

I would suggest going ahead and submitting a JIRA request if there isn’t one already.

JIRA submitted which will hopefully cover autodiscovery of Counters. Not sure if you'll also have to consider adding direct instrumentation to the custom RDD classes as well @costin or if that issue (once solved) would cover things.


(Costin Leau) #9

Thanks for raising the issue.

In ES-Spark, in particular the RDD/SparkSQL integration there's no problem to call Spark code to publish our counters. Registering a Source might work though at first glance this is currently tied to spark - the metric registry is an active component that seems to be bound to the SparkContext lifecycle. Which is far from ideal.

However I'll poke around to see what can be done - likely however this will be done based on the Spark 2.0 code to reduce the number of breakage inside the code base.

Cheers,


(Angel Cervera Claudio) #10

Any workaround or progress to extract metrics?
I created a question in Stackoverflow. I don't think that other people will find other way or a workaround, but to try is free: http://stackoverflow.com/questions/43186685/retrieve-metrics-from-elasticsearch-spark


(system) #11