We have a test ES cluster on AWS (the managed service) and would like to now run a spark job using filtered data from the cluster. I expect that df filtering is done on the backend (the so called pushdown), and i was looking for ways to review the actual queries that are run.
I found a way that requires a log4j.properties file by setting the httpclient.wire.content category to DEBUG, but it feels like the wrong way since this yields much more logging then i need and also i don't always see the post content.
So my question is what the advised way is to debug the queries made on the backend?
org.elasticsearch.hadoop.rest is the category that you need (potentially log4j.category.org.elasticsearch.hadoop.rest.commonshttp to restrict it just to transport).
There's no real separation between queries as in a session and transport since the queries are for the most part just part HTTP calls. Spark is special due to pushdown but if that is disabled there is no query generated, just typical HTTP calls.
By the way, in terms of filtering itself take a look at org.elasticsearch.spark.sql package which indicates what Spark filters are being translated and to what.
Speaking of which.... (i do realize it is a bit off topic) is the "pushdown" functionality something specific for ES? Or does aws DynamoDB also offer such performance enhancements?
Not sure what you are asking. Pushdown means some of the operations executed by spark SQL are executed directly by ES and thus result in faster time and execution. This depends on a variety of factors and it's highly related to the implementation (spark SQL and ES in this case).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.