Hello ,
I did a pull request.
These metrics will be very useful to monitor a Spark application using the extension Elasticsearch for Hadoop.
What do you think about it?
Hello ,
I did a pull request.
These metrics will be very useful to monitor a Spark application using the extension Elasticsearch for Hadoop.
What do you think about it?
Thanks for the PR! That seems like a good idea, assuming you're not updating the accumulators inside of a transformation (I haven't looked at the code yet) since that could double-count things. The whole team has been incredibly busy the last few weeks, but we definitely plan on taking a look at this.
Happy to know that this feature seems also useful to the core team!
Yes, with the lazy behavior of Spark, the attempts of Spark tasks and the retries of embedded Elasticsearch client provoke that the counters can be increased several times for the same batch or Spark task. But it makes sense for each metric because you monitor the interactions' client / server, where the communications that can be redundant to export / import a same portion of data, or where the code is executed several times because the RDD/DataFrame/Dataset is not persisted in cache.
The export from Elasticsearch (reads) is a Spark transformation, and the import to Elasticsearch (writes) is a Spark action.
I wait for with haste your code review and feedbacks to adapt and improve the code!
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.
© 2020. All Rights Reserved - Elasticsearch
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries.