1. problem background
I am using an elasticsearch cluster to save the result from spark2.3.
Also this cluster offer an online query .
My spark task is a daily work that will write 6 million records to ES cluster every day .
now the process of writing to es will use 10min every day , indexing speed is about 10k per second, but during this 10min, there will be some queries spend more the 1 second , but if i test this query in the other time of the day (not this 10min), the query is very fast (about 10 milliseconds).
So i think that maybe the es cluster is over take , and i want to lower the write speed (i can accept longer time for the writing process) so that the online query will be faster.
the ES-Hadoop maven
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-spark-20_2.11</artifactId>
<version>7.8.1</version>
</dependency>
2. The work i have tried
- use less executors of spark
--executor-cores 1 --num-executors 1
- reduce the write batch size and close the refresh
SparkConf sparkConf = new SparkConf()
...
.set(ConfigurationOptions.ES_BATCH_SIZE_ENTRIES, "50")
.set(ConfigurationOptions.ES_BATCH_WRITE_REFRESH, "false")
but the speed is still too high for me
i want to know is there any other way to throttle the write speed for ES-Hadoop
sorry for bother you