Spark SQL advices for performance

kucera.jan.cz · March 25, 2016, 9:24pm

Hello everyone,

I am a novice in Spark SQL and I am looking for advice regarding performance. I have following use case:
I need to join two indices in Elasticsearch (currently both have approx. 2M docs) and results should be send to MySQL table. Here is my Spark SQL:

result.write.mode("append").jdbc(dbUrl, mysqlTable, prop)
`

Unfortunately the execution is suppose to be quite fast, currently I am loading the data on single node in 26 secs. Is there any advice how I can improve performance? I don't see any good spot where caching might help.

Currently I am measuring on single node with 10 partitions, but in production I expect mode of docs (not just two, but dozens) and working on 3-5 nodes.

Thanks in advance

Jan

costin · April 5, 2016, 2:45pm

Check out your plan, enable logging and see whether there's a way to improve the query.
The issue with JOINs is that Spark has to do it manually as it does not push down this information and as such, one ends up streaming all the data from ES to Spark to do the joins.
Make sure to use the latest ES-Hadoop and Spark.
All the advices that apply to ES (good hardware, plenty of RAM for the OS, etc...) apply here as well.
A single node with 10 partitions is the same as 1 node with one partition - parallelism makes sense when you have multiple nodes to take advantage of them, otherwise is for naught.
Caching makes sense only if you keep re-reading the data. Otherwise is a waste of RAM.

kucera.jan.cz · April 7, 2016, 1:00pm

Thank you very much @costin. I really appreciate your work you're doing on ES-Spark.

Jan

Topic		Replies	Views
[Hadoop] Slow performance of Elasticsearch-Hadoop + Spark SQL Elasticsearch	2	998	July 6, 2017
Tunning ElasticSearch with Spark Elasticsearch	1	382	July 5, 2017
Slow Performance of Elastic Search with Spark Elasticsearch es-hadoop	4	1535	July 29, 2021
Performance Challenge Elasticsearch es-hadoop	6	1081	April 28, 2017
Performance of Spark bulk index to Elasticsearch Elasticsearch es-hadoop	3	2599	September 1, 2017

Spark SQL advices for performance

Related topics