Hello everyone,
I am a novice in Spark SQL and I am looking for advice regarding performance. I have following use case:
I need to join two indices in Elasticsearch (currently both have approx. 2M docs) and results should be send to MySQL table. Here is my Spark SQL:
`val t1SQL=
"""
|SELECT KEY, COUNT(DISTINCT COL_A)
|FROM T1
|GROUP BY KEY
""".stripMargin
val t2SQL =
"""
|SELECT KEY, SUM(COL_B), MAX(COL_C)
|FROM T2
|GROUP BY KEY
""".stripMargin
val t1 = sqlContext.sql(t1SQL)
val t2 = sqlContext.sql(t2SQL)
val result = t1.join(t2, "KEY")
result.write.mode("append").jdbc(dbUrl, mysqlTable, prop)
`
Unfortunately the execution is suppose to be quite fast, currently I am loading the data on single node in 26 secs. Is there any advice how I can improve performance? I don't see any good spot where caching might help.
Currently I am measuring on single node with 10 partitions, but in production I expect mode of docs (not just two, but dozens) and working on 3-5 nodes.
Thanks in advance
- Jan