How to increase writing speed to an index using Spark ES

ElasticQuestion_1234 · December 7, 2021, 11:09pm

Hi, I am trying to write a DataFrame with 10k rows and 31 columns into an Elasticsearch index using spark. I am using "JavaEsSparkSQL.saveToEs" function. There are 12 concurrent tasks. I used the default "es.batch.size.byte" and "es.batch.size.entries".

It will take several seconds to write. I want to get it down to less than 1s. Is the "JavaEsSparkSQL.saveToEs" function doing bulk insert? Are there any good practice for increasing the indexing speed? Thanks in advance.

Christian_Dahlqvist · December 8, 2021, 6:16am

Which version of Elasticsearch are you using? What is the size and hardware specification of the cluster you are indexing into? How many indices and shards are you actively indexing into? Are you indexing immutable documents or also performing updates?

DineshNaik · December 8, 2021, 7:14am

Refer this as well : Tune for indexing speed | Elasticsearch Guide [master] | Elastic

ElasticQuestion_1234 · December 8, 2021, 2:59pm

I am using 7.9.0 version. I currently have a single node cluster with Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz, 2592 Mhz, 6 Core(s), 12 Logical Processor(s), 32GB RAM. I am actively writing into one index with 5 primary shards and 5 replica shards. The rate will be a DataFrame with 10k and 31 columns in every 2 seconds. That means the index is getting updated every 2 seconds.

Hope this information helps. I believe the data is not very large, the speed should be faster. It is just I am not familiar with how to tune it. Thanks for the help.

ElasticQuestion_1234 · December 8, 2021, 3:03pm

Thanks. I will look into this guide. However, I tried tuning "es.batch.size.byte" and "es.batch.size.entries" for bulk insert, which didn't make much difference. Do you know if .saveToEs using Bulk request?

Christian_Dahlqvist · December 8, 2021, 3:36pm

I would recommend upgrading Elasticsearch as I believe version 7.9.0 had a memory leak in Lucene. Elasticsearch is often limited by storage performance rather than CPU, especially when indexing. Are you using local SSDs?

Keith_Massey · December 8, 2021, 4:52pm

Yes, saveToEs uses bulk requests controlled by the properties you mention.

ElasticQuestion_1234 · December 8, 2021, 6:19pm

Thanks for the reply. I am just using 1T local HDD. Does 7.10.2 version had the same problem as well? I will try with 7.10.2 and see the performance. So the indexing is bounded by the server's I/O performance?

Christian_Dahlqvist · December 8, 2021, 7:32pm

Elasticsearch 7.10 does not have that issue.

That is often the case, especially when slow storage is used.

ElasticQuestion_1234 · December 8, 2021, 7:35pm

Thanks, I will give 7.10 a try.

Christian_Dahlqvist · December 8, 2021, 7:38pm

Unless your total data volume dictates that you require 5 primary shards it might be worthwhile testing indexing into an index with a single primary shard.

ElasticQuestion_1234 · December 9, 2021, 5:43am

Thanks. I will try reducing primary shards as well.

system · January 6, 2022, 5:43am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Spark write parquet record to elasticsearch too slowly Elasticsearch es-hadoop	4	1890	July 6, 2017
Tunning ElasticSearch with Spark Elasticsearch	1	383	July 5, 2017
Throttle the ES-Hadoop write speed Elasticsearch es-hadoop	3	631	September 29, 2020
Bulk write to ES \| best practices Elasticsearch es-hadoop	4	5565	July 6, 2017
ES - Spark tuning for bulk writes Elasticsearch es-hadoop	17	2763	January 24, 2021

How to increase writing speed to an index using Spark ES

Related topics