Hi
I have a simple application that reads IDs from ES and returns the number of distinct IDs using Spark.
My configuration is:
-
ES 2.2 (cluster)
2 clients, 3 master, 4 data (16G RAM, 8G heap) -
elasticsearch-hadoop-2.2.0
-
Spark 1.6.1 (cluster) & Scala version 2.10.5
1 master (2 CPU 2.90GHz, 4G RAM) and 3 workers (2 CPU 2.40GHz, 8G RAM)
I gets performance issues with spark. The program runs in the same time (about 1m) with 1, 2 or 3 workers. How can I improve the performance to read from ES and to use all the workers properly?
ES and Spark are on different servers. Is it better to install Spark on ES servers to limit data transportation?
Scala code:
import org.apache.spark._
import org.elasticsearch.spark._
object ElasticsearchTest {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("elasticTest").set("es.nodes", "172.48.0.182").set("spark.executor.memory","6G").set("spark.executor.cores","2").set("spark.driver.cores","2").set("spark.cores.max","6")
val sc = new SparkContext(sparkConf)
val query = "{\"fields\":[\"obj_id\"],\"query\":{\"bool\":{\"filter\":{\"bool\":{\"must\":[{\"range\":{\"date\":{\"gte\":\"20150101\",\"lte\":\"20150631\"}}}]}}}}}"
val rdd = sc.esRDD("myindex/mytype", query)
val countVal = rdd.map{case (key, value) => value("obj_id")}
val docCount = countVal.distinct().count();
println("Count of distinct objects with query : %s".format(docCount))
sc.stop()
}
}