Slow Performance of Elastic Search with Spark

I am trying to read the data from Elastic Search into a dataframe using Java ES-Spark-connector.. Now when I try to execute a query for example count() on the dataframe, the performance is dismal. For a 3Gb data it takes around 6 mins. On the other hand if I save the data in hadoop/hdfs it and then read it from there it takes around 3s. Can some one tell me a work around for this. The code I am using is as below.

SparkConf conf = new SparkConf().setAppName("Simple App").setMaster("local[*]");
        conf.set("es.index.auto.create", "true");
        JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
       
        SparkSession spark = new SparkSession(javaSparkContext.sc());
        SQLContext sqlContext = new SQLContext(spark);
        Dataset<Row> ds = spark.read().format("org.elasticsearch.spark.sql").load("index_name");
       count = ds.count();   // this takes around 6 mins for 3GB data
1 Like

I am also having performance problems, the same Spark query I do from the raw JSON files is 12-15 times faster than the same query via ES.
I have about this is on the github issues but they told me to ask in here, anyway we need to find a way to inspect the push down query to understand what's going on.

1 Like

And even if you do this:

ds.count().explain(extended=True)

does not show you the ES queries.

I have another discussion open here for monitoring queries in the backend which seems the only solution for now.

Please read this too because what is essentially happening is that the ES driver is not that smart and does just a full scroll of the data instead of leveraging the ES capabilities.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.