How to increase writing speed to an index using Spark ES

Hi, I am trying to write a DataFrame with 10k rows and 31 columns into an Elasticsearch index using spark. I am using "JavaEsSparkSQL.saveToEs" function. There are 12 concurrent tasks. I used the default "es.batch.size.byte" and "es.batch.size.entries".

It will take several seconds to write. I want to get it down to less than 1s. Is the "JavaEsSparkSQL.saveToEs" function doing bulk insert? Are there any good practice for increasing the indexing speed? Thanks in advance.

Which version of Elasticsearch are you using? What is the size and hardware specification of the cluster you are indexing into? How many indices and shards are you actively indexing into? Are you indexing immutable documents or also performing updates?

Refer this as well : Tune for indexing speed | Elasticsearch Guide [master] | Elastic

1 Like

I am using 7.9.0 version. I currently have a single node cluster with Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz, 2592 Mhz, 6 Core(s), 12 Logical Processor(s), 32GB RAM. I am actively writing into one index with 5 primary shards and 5 replica shards. The rate will be a DataFrame with 10k and 31 columns in every 2 seconds. That means the index is getting updated every 2 seconds.

Hope this information helps. I believe the data is not very large, the speed should be faster. It is just I am not familiar with how to tune it. Thanks for the help.

Thanks. I will look into this guide. However, I tried tuning "es.batch.size.byte" and "es.batch.size.entries" for bulk insert, which didn't make much difference. Do you know if .saveToEs using Bulk request?

I would recommend upgrading Elasticsearch as I believe version 7.9.0 had a memory leak in Lucene. Elasticsearch is often limited by storage performance rather than CPU, especially when indexing. Are you using local SSDs?

Yes, saveToEs uses bulk requests controlled by the properties you mention.

1 Like

Thanks for the reply. I am just using 1T local HDD. Does 7.10.2 version had the same problem as well? I will try with 7.10.2 and see the performance. So the indexing is bounded by the server's I/O performance?

Elasticsearch 7.10 does not have that issue.

That is often the case, especially when slow storage is used.

Thanks, I will give 7.10 a try.

Unless your total data volume dictates that you require 5 primary shards it might be worthwhile testing indexing into an index with a single primary shard.

Thanks. I will try reducing primary shards as well.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.