Our spark streaming user case is, read streaming data from Kafka, and join with index from Elastic Search.
There is another spark streaming job update the Elastic Search index at a fix interval, which means the index data is not static.
Platform details is as below.
Spark 1.6.2 standalone cluster with 15 nodes. 90 cores 500G memory.
Elastic Search 2.4 with 10 data nodes. about 1Billion documents in Elastic search about 250G. 20 shards.
Our interval is 1hour and the job read the whole index data from Elastic search is 50 minutes.
is there any way to improve the performance of the job?