Reading from Elasticsearch index using spark ( es-hadoop ) connectors

Khushboo_Kaul · February 18, 2022, 10:22am

HI Team,

I was trying to read the index from Elasticsearch using es-hadoop connectors using spark using the below code-
The index contains millions of records and the read is super slow.
is there anything I can do to do better the read?

es_df = spark.read.format("org.Elasticsearch.spark.sql")
.option("es.nodes.wan.only","true")
.option("es.port",port)
.option("es.net.ssl","false")
.option("es.nodes", "***")
.load(f"{final_resource}")
.select("col1","col2")

Keith_Massey · February 22, 2022, 3:04pm

Hi @Khushboo_Kaul. The answer is almost certainly "yes". I'd probably start with checking the number of executors you are configured to use in spark. You want as much parallelism as your cluster can handle. You might have to experiment a little to find out what the best number is. You want at least one per Elasticsearch data node (and you can probably handle a good bit more than that).
Another thing I'd check is the es.scroll.size. This is the number of documents that are pulled back with every request (Configuration | Elasticsearch for Apache Hadoop [8.0] | Elastic). You didn't mention which version you are using, but the default was 50 until 8.0. We changed it to 1000 because 50 is too low for a lot of use cases. Setting es.scroll.size higher will use more memory, but that's probably an acceptable tradeoff. If you are using an es-hadoop version earlier than 8.0 try setting it to 1000.

system · March 22, 2022, 3:04pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Correct setting of "es.scroll.size" with for optimal Spark read performance Elasticsearch es-hadoop	2	3512	July 27, 2017
Elasticsearch + Spark read performance issues Elasticsearch es-hadoop	3	2302	May 24, 2016
Performance of Spark bulk index to Elasticsearch Elasticsearch es-hadoop	3	2640	September 1, 2017
Spark read from elasticsearch and different indexes, the speed of reading is five times worse Elasticsearch es-hadoop	1	722	November 7, 2018
Reading from Elasticsearch to Spark is very slow Elasticsearch	1	856	July 29, 2019

Reading from Elasticsearch index using spark ( es-hadoop ) connectors

Related topics