You should send requests to all data nodes. The master does nothing special for request processing. Have you tried increasing the level of parallelism?
I am sending data to only one IP, which is the master node. (es_host)
Data is being written in all the 3 nodes through elasticsearch.yml config.
After a certain value of parallelism, there is failure in some tasks of spark, thus the operation fails. (Tried parallelism 8 and 16).
How to send data to all data node. for example I have 3 es_host IPs.
Unfortunately, as you've discovered, there's no single best configuration for writing from spark to Elasticsearch. It's a careful balance of trying to send as much data as Elasticsearch can handle, without overwhelming Elasticsearch.
How to send data to all data node. for example I have 3 es_host IPs.
You can put a comma-delimited list of nodes in the es.nodes setting.
.option("es.write.operation.parallelism", "4")
I can't find any reference to that setting in the code. Did you mean to put something else here? I think you can probably remove this.
How many spark executors are you writing from? I assume Tried parallelism 8 and 16 means you've tried writing from 8 and 16 executors? Given that your whole cluster only has 3 shards, you're probably not going to get much benefit from using that many executors. Could you try writing to an index with more shards? How many failures are you getting?
My general advice would be to increase the number of shards somewhat. Then start writing from spark to all Elasticsearch nodes with the number of executors equal to the number of shards you have. Then slowly increase the number of executors until Elasticsearch starts rejecting bulk requests. Then reduce the number of executors a little and make sure you're not getting any rejections.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.