I am trying to read the data from Elastic Search into a dataframe using Java ES-Spark-connector.. Now when I try to execute a query for example count() on the dataframe, the performance is dismal. For a 3Gb data it takes around 6 mins. On the other hand if I save the data in hadoop/hdfs it and then read it from there it takes around 3s. Can some one tell me a work around for this. The code I am using is as below.
SparkConf conf = new SparkConf().setAppName("Simple App").setMaster("local[*]");
conf.set("es.index.auto.create", "true");
JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
SparkSession spark = new SparkSession(javaSparkContext.sc());
SQLContext sqlContext = new SQLContext(spark);
Dataset<Row> ds = spark.read().format("org.elasticsearch.spark.sql").load("index_name");
count = ds.count(); // this takes around 6 mins for 3GB data