Spark SQL advices for performance


#1

Hello everyone,

I am a novice in Spark SQL and I am looking for advice regarding performance. I have following use case:
I need to join two indices in Elasticsearch (currently both have approx. 2M docs) and results should be send to MySQL table. Here is my Spark SQL:

`val t1SQL=
"""
|SELECT KEY, COUNT(DISTINCT COL_A)
|FROM T1
|GROUP BY KEY
""".stripMargin
val t2SQL =
"""
|SELECT KEY, SUM(COL_B), MAX(COL_C)
|FROM T2
|GROUP BY KEY
""".stripMargin
val t1 = sqlContext.sql(t1SQL)
val t2 = sqlContext.sql(t2SQL)
val result = t1.join(t2, "KEY")

result.write.mode("append").jdbc(dbUrl, mysqlTable, prop)
`

Unfortunately the execution is suppose to be quite fast, currently I am loading the data on single node in 26 secs. Is there any advice how I can improve performance? I don't see any good spot where caching might help.

Currently I am measuring on single node with 10 partitions, but in production I expect mode of docs (not just two, but dozens) and working on 3-5 nodes.

Thanks in advance

  • Jan

(Costin Leau) #2
  1. Check out your plan, enable logging and see whether there's a way to improve the query.
    The issue with JOINs is that Spark has to do it manually as it does not push down this information and as such, one ends up streaming all the data from ES to Spark to do the joins.

  2. Make sure to use the latest ES-Hadoop and Spark.

  3. All the advices that apply to ES (good hardware, plenty of RAM for the OS, etc...) apply here as well.

  4. A single node with 10 partitions is the same as 1 node with one partition - parallelism makes sense when you have multiple nodes to take advantage of them, otherwise is for naught.

  5. Caching makes sense only if you keep re-reading the data. Otherwise is a waste of RAM.


#3

Thank you very much @costin. I really appreciate your work you're doing on ES-Spark.

  • Jan

(system) #4