Hi, I am trying to write a DataFrame with 10k rows and 31 columns into an Elasticsearch index using spark. I am using "JavaEsSparkSQL.saveToEs" function. There are 12 concurrent tasks. I used the default "es.batch.size.byte" and "es.batch.size.entries".
It will take several seconds to write. I want to get it down to less than 1s. Is the "JavaEsSparkSQL.saveToEs" function doing bulk insert? Are there any good practice for increasing the indexing speed? Thanks in advance.
Which version of Elasticsearch are you using? What is the size and hardware specification of the cluster you are indexing into? How many indices and shards are you actively indexing into? Are you indexing immutable documents or also performing updates?
I am using 7.9.0 version. I currently have a single node cluster with Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz, 2592 Mhz, 6 Core(s), 12 Logical Processor(s), 32GB RAM. I am actively writing into one index with 5 primary shards and 5 replica shards. The rate will be a DataFrame with 10k and 31 columns in every 2 seconds. That means the index is getting updated every 2 seconds.
Hope this information helps. I believe the data is not very large, the speed should be faster. It is just I am not familiar with how to tune it. Thanks for the help.
Thanks. I will look into this guide. However, I tried tuning "es.batch.size.byte" and "es.batch.size.entries" for bulk insert, which didn't make much difference. Do you know if .saveToEs using Bulk request?
I would recommend upgrading Elasticsearch as I believe version 7.9.0 had a memory leak in Lucene. Elasticsearch is often limited by storage performance rather than CPU, especially when indexing. Are you using local SSDs?
Thanks for the reply. I am just using 1T local HDD. Does 7.10.2 version had the same problem as well? I will try with 7.10.2 and see the performance. So the indexing is bounded by the server's I/O performance?
Unless your total data volume dictates that you require 5 primary shards it might be worthwhile testing indexing into an index with a single primary shard.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.