RDD saveToEs performance

Hi there,

With Es 2.2, Spark 1.6, Scala 2.10, SaveToEs performance is around 20 documents/second on MacBook Pro. Each document is less that 1KB. Is this expected? I am using the latest 2.2 elasticsearch-spark connector.

Adding logs of processing 10k messages:
16/02/27 17:11:30 INFO SparkContext: Starting job: runJob at EsSpark.scala:67
16/02/27 17:11:30 INFO DAGScheduler: Got job 0 (runJob at EsSpark.scala:67) with 1 output partitions
16/02/27 17:11:30 INFO DAGScheduler: Final stage: ResultStage 0 (runJob at EsSpark.scala:67)
16/02/27 17:14:28 INFO DAGScheduler: ResultStage 0 (runJob at EsSpark.scala:67) finished in 178.170 s
16/02/27 17:14:28 INFO DAGScheduler: Job 0 finished: runJob at EsSpark.scala:67, took 178.347390 s
16/02/27 17:14:29 INFO SparkContext: Starting job: runJob at EsSpark.scala:67
16/02/27 17:14:29 INFO DAGScheduler: Got job 1 (runJob at EsSpark.scala:67) with 1 output partitions
16/02/27 17:14:29 INFO DAGScheduler: Final stage: ResultStage 1 (runJob at EsSpark.scala:67)
16/02/27 17:17:24 INFO DAGScheduler: ResultStage 1 (runJob at EsSpark.scala:67) finished in 175.553 s
16/02/27 17:17:24 INFO DAGScheduler: Job 1 finished: runJob at EsSpark.scala:67, took 175.585507 s
16/02/27 17:17:24 INFO SparkContext: Starting job: runJob at EsSpark.scala:67
16/02/27 17:17:24 INFO DAGScheduler: Got job 2 (runJob at EsSpark.scala:67) with 1 output partitions
16/02/27 17:17:24 INFO DAGScheduler: Final stage: ResultStage 2 (runJob at EsSpark.scala:67)
16/02/27 17:20:18 INFO DAGScheduler: ResultStage 2 (runJob at EsSpark.scala:67) finished in 174.285 s
16/02/27 17:20:18 INFO DAGScheduler: Job 2 finished: runJob at EsSpark.scala:67, took 174.286503 s

Thanks,

Hello aaskey,

20 docs/sec is definitely not expected throughput. I was recently prototyping some big doc size insertion and I was able to index 2k/sec on single-node.

Neverless I am wondering whether someone here can suggest best practices (except official docs) how to increase insertion rate to ES.

I have read interesting article about performance with spark-kafka which suggested client caching and wondering how much overhead RestService.createWriter causes for creating client for every call.

Hi Kucera,

Is there any example code you can share or if you have done anything other than the official document suggested?

On Spark side I didn't anything special. The elasticsearch itself was little tweaked from default configuration to support higher indexing throughput.

Regarding sharing scala code - when I was starting with spark-elastic this github repo was really useful.

-Jan

Thanks Jan. I found out the the low performance as caused by other operations. Without those operations, I able to achieve the similar (2k/s) write performance to ES. Thanks again.

@aaskey Adding some monitoring on ES side (such as Marvel) helps in figuring out whether ES itself is overloaded or whether the OS itself is under pressure.
Always keep an eye on CPU/IO/Mem.

Glad to hear things are back to normal.

@kucera.jan.cz RestService.createWriter is used once per client not once for every call. At least within ES-Hadoop itself - if the users keep creating jobs on every call then yes, the partition discovery and execution will be created quite often however at this point, likely Spark itself and the rest of the components will add a significant overhead already.