Reading from Elasticsearch index using spark ( es-hadoop ) connectors

HI Team,

I was trying to read the index from Elasticsearch using es-hadoop connectors using spark using the below code-
The index contains millions of records and the read is super slow.
is there anything I can do to do better the read?

es_df = spark.read.format("org.Elasticsearch.spark.sql")
.option("es.nodes.wan.only","true")
.option("es.port",port)
.option("es.net.ssl","false")
.option("es.nodes", "***")
.load(f"{final_resource}")
.select("col1","col2")

Hi @Khushboo_Kaul. The answer is almost certainly "yes". I'd probably start with checking the number of executors you are configured to use in spark. You want as much parallelism as your cluster can handle. You might have to experiment a little to find out what the best number is. You want at least one per Elasticsearch data node (and you can probably handle a good bit more than that).
Another thing I'd check is the es.scroll.size. This is the number of documents that are pulled back with every request (Configuration | Elasticsearch for Apache Hadoop [8.0] | Elastic). You didn't mention which version you are using, but the default was 50 until 8.0. We changed it to 1000 because 50 is too low for a lot of use cases. Setting es.scroll.size higher will use more memory, but that's probably an acceptable tradeoff. If you are using an es-hadoop version earlier than 8.0 try setting it to 1000.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.