Realize this is maybe a question better suited for the Spark user group, but figured I'd try here first.
It's awesome that elasticsearch-hadoop collects so many metrics in the process or reading/writing from/to ES. Is it possible to access any of these counters via Spark? Done a little Googling on accessing MR counters in Spark but it doesn't seem like a possibility.
Would help greatly with tuning some of our production jobs to see some of these metrics.
ES-Hadoop exposes the metrics in MR since that's easy to do. Internally however these stats do not depend on MR and can be easily exposed to other runtimes.
In case of Spark, while it does seem to support some metrics, it seems something internal, not really meant to be pluggable - there are no docs around it and there doesn't seem to be any proper UI where one can get access to it (even reading it in the console is not an option as far as I know).
In other words, if there would be a proper, public API to extend the metrics in Spark, ES-Hadoop could expose these there as well.
Thanks @costin! I'll leave a note on the Spark mailing list and link to this thread to see if there are future plans to support a metrics framework for Spark.
From what I can tell, the issue is that these Sinks are for exposing Spark stats to other storages (Spark provides some out of the box). In ES-Hadoop case and, from what I understood in this thread, it's more about exposing the ES-Hadoop stats to Spark itself.
Which doesn't seem to be supported. In other words, Spark can only expose its stats but not allow other apps to extend its stats like Hadoop.
From the existing sources though, I can't see an easy way to instantiate something like an ESHadoopSource without modifying some part of core Spark. I suppose the custom RDDs you've created in the project could be modified directly to support a new source, but users using saveAsNewAPIHadoopFile would need modifications to core Spark (I think).
I misunderstood what you were trying to do. I thought you were just looking to create custom metrics vs looking for the existing Hadoop Output Format counters.
I’m not familiar enough with the Hadoop APIs but I think it would require a change to the SparkHadoopWriter class since it generates the JobContext which is required to read the counters. Then it could publish the counters to the Spark metrics system.
I would suggest going ahead and submitting a JIRA request if there isn’t one already.
JIRA submitted which will hopefully cover autodiscovery of Counters. Not sure if you'll also have to consider adding direct instrumentation to the custom RDD classes as well @costin or if that issue (once solved) would cover things.
In ES-Spark, in particular the RDD/SparkSQL integration there's no problem to call Spark code to publish our counters. Registering a Source might work though at first glance this is currently tied to spark - the metric registry is an active component that seems to be bound to the SparkContext lifecycle. Which is far from ideal.
However I'll poke around to see what can be done - likely however this will be done based on the Spark 2.0 code to reduce the number of breakage inside the code base.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.