RDD saveToEs performance

aaskey · February 28, 2016, 9:05am

Hi there,

With Es 2.2, Spark 1.6, Scala 2.10, SaveToEs performance is around 20 documents/second on MacBook Pro. Each document is less that 1KB. Is this expected? I am using the latest 2.2 elasticsearch-spark connector.

Adding logs of processing 10k messages:
16/02/27 17:11:30 INFO SparkContext: Starting job: runJob at EsSpark.scala:67
16/02/27 17:11:30 INFO DAGScheduler: Got job 0 (runJob at EsSpark.scala:67) with 1 output partitions
16/02/27 17:11:30 INFO DAGScheduler: Final stage: ResultStage 0 (runJob at EsSpark.scala:67)
16/02/27 17:14:28 INFO DAGScheduler: ResultStage 0 (runJob at EsSpark.scala:67) finished in 178.170 s
16/02/27 17:14:28 INFO DAGScheduler: Job 0 finished: runJob at EsSpark.scala:67, took 178.347390 s
16/02/27 17:14:29 INFO SparkContext: Starting job: runJob at EsSpark.scala:67
16/02/27 17:14:29 INFO DAGScheduler: Got job 1 (runJob at EsSpark.scala:67) with 1 output partitions
16/02/27 17:14:29 INFO DAGScheduler: Final stage: ResultStage 1 (runJob at EsSpark.scala:67)
16/02/27 17:17:24 INFO DAGScheduler: ResultStage 1 (runJob at EsSpark.scala:67) finished in 175.553 s
16/02/27 17:17:24 INFO DAGScheduler: Job 1 finished: runJob at EsSpark.scala:67, took 175.585507 s
16/02/27 17:17:24 INFO SparkContext: Starting job: runJob at EsSpark.scala:67
16/02/27 17:17:24 INFO DAGScheduler: Got job 2 (runJob at EsSpark.scala:67) with 1 output partitions
16/02/27 17:17:24 INFO DAGScheduler: Final stage: ResultStage 2 (runJob at EsSpark.scala:67)
16/02/27 17:20:18 INFO DAGScheduler: ResultStage 2 (runJob at EsSpark.scala:67) finished in 174.285 s
16/02/27 17:20:18 INFO DAGScheduler: Job 2 finished: runJob at EsSpark.scala:67, took 174.286503 s

Thanks,

kucera.jan.cz · February 29, 2016, 7:59pm

Hello aaskey,

20 docs/sec is definitely not expected throughput. I was recently prototyping some big doc size insertion and I was able to index 2k/sec on single-node.

Neverless I am wondering whether someone here can suggest best practices (except official docs) how to increase insertion rate to ES.

I have read interesting article about performance with spark-kafka which suggested client caching and wondering how much overhead RestService.createWriter causes for creating client for every call.

aaskey · March 1, 2016, 4:13am

Hi Kucera,

Is there any example code you can share or if you have done anything other than the official document suggested?

kucera.jan.cz · March 1, 2016, 5:23am

On Spark side I didn't anything special. The elasticsearch itself was little tweaked from default configuration to support higher indexing throughput.

Regarding sharing scala code - when I was starting with spark-elastic this github repo was really useful.

-Jan

aaskey · March 1, 2016, 8:14am

Thanks Jan. I found out the the low performance as caused by other operations. Without those operations, I able to achieve the similar (2k/s) write performance to ES. Thanks again.

costin · March 3, 2016, 9:53am

@aaskey Adding some monitoring on ES side (such as Marvel) helps in figuring out whether ES itself is overloaded or whether the OS itself is under pressure.
Always keep an eye on CPU/IO/Mem.

Glad to hear things are back to normal.

@kucera.jan.cz RestService.createWriter is used once per client not once for every call. At least within ES-Hadoop itself - if the users keep creating jobs on every call then yes, the partition discovery and execution will be created quite often however at this point, likely Spark itself and the rest of the components will add a significant overhead already.

Topic		Replies	Views
saveToEs Write performance (elasticsearch-spark) Elasticsearch es-hadoop	3	2743	July 6, 2017
How to increase writing speed to an index using Spark ES Elasticsearch es-hadoop	12	1818	January 6, 2022
Performance of Spark bulk index to Elasticsearch Elasticsearch es-hadoop	3	2609	September 1, 2017
Perofrmance problem on es-hadoop + spark Elasticsearch es-hadoop	5	1279	July 6, 2017
Tunning ElasticSearch with Spark Elasticsearch	1	384	July 5, 2017

RDD saveToEs performance

Related topics