I am trying to read the data from Elastic Search into a dataframe using Java ES-Spark-connector.. Now when I try to execute a query for example count() on the dataframe, the performance is dismal. For a 3Gb data it takes around 6 mins. On the other hand if I save the data in hadoop/hdfs it and then read it from there it takes around 3s. Can some one tell me a work around for this. The code I am using is as below.
SparkConf conf = new SparkConf().setAppName("Simple App").setMaster("local[*]");
conf.set("es.index.auto.create", "true");
JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
SparkSession spark = new SparkSession(javaSparkContext.sc());
SQLContext sqlContext = new SQLContext(spark);
Dataset<Row> ds = spark.read().format("org.elasticsearch.spark.sql").load("index_name");
count = ds.count(); // this takes around 6 mins for 3GB data
I am also having performance problems, the same Spark query I do from the raw JSON files is 12-15 times faster than the same query via ES.
I have about this is on the github issues but they told me to ask in here, anyway we need to find a way to inspect the push down query to understand what's going on.
Please read this too because what is essentially happening is that the ES driver is not that smart and does just a full scroll of the data instead of leveraging the ES capabilities.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.