I see a big difference in performance of the same query expressed via Spark
SQL and CURL.
In CURL the query runs less then a second, and in Spark SQL it runs 15
seconds.
The index/type which I am querying contains 1M documents.
Can you please explain why there is so big difference in performance?
Are there any ways to tune performance of Elasticsearch + Spark SQL?
Environment: (everything is running on the same box):
Elasticsearch 1.4.4
elasticsearch-hadoop 2.1.0.BUILD-SNAPSHOT
Spark 1.3.0.
The best way is to use a profiler to understand where time is spent.
Spark while it is significantly faster than Hadoop, cannot compete with CULR.
The latter is a simple REST connection - the former triggers a JVM, Scala, akka, Spark,
which triggers es-hadoop which does the parallel call against all the nodes, retries the data in JSON format,
converts it into Scala/Java and applies on schema on top for Spark SQL to run with.
If you turn on logging, you'll see in fact there are multiple REST/CURL calls done by es-hadoop.
With a JVM/Scala warmed up, you should see less than 15s however it depends on how much hardware you have available.
Note that the curl comparison is not really fair - adding a SQL layer on top of that is bound to cost you something.
On 6/1/15 8:47 PM, Dmitriy Fingerman wrote:
Hi,
I see a big difference in performance of the same query expressed via Spark SQL and CURL.
In CURL the query runs less then a second, and in Spark SQL it runs 15 seconds.
The index/type which I am querying contains 1M documents.
Can you please explain why there is so big difference in performance?
Are there any ways to tune performance of Elasticsearch + Spark SQL?
Environment: (everything is running on the same box):
Elasticsearch 1.4.4
elasticsearch-hadoop 2.1.0.BUILD-SNAPSHOT
Spark 1.3.0.
val sparkConf = new SparkConf().setAppName("Test1")
// increasing scroll size to 5000 from the default 50 improved performance by 2.5 times
sparkConf.set("es.scroll.size", "5000")
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
val intv = sqlContext.esDF("summary/intervals")
intv.registerTempTable("INTERVALS")
val intv2 = sqlContext.sql("select EventCount, Hour " +
"from intervals " +
"where User = 'Robert Greene' " +
"and DataStore = 'PROD_HK_HR' " +
"and EventAffectedCount = 56 ")
intv2.show(1000)
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.