Run PySpark job on EMR Serverless that reads data from S3 and writes it into Elastic cloud.
Errors
Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only' when attempting to write PySpark dataframe to Elastic cloud (v8.7.0).
ERROR NetworkClient: Node [Radar_DB:XXXX:9243] failed (java.net.UnknownHostException: Radar_DB); no other nodes left - aborting...
I see this error when running my job locally or on EMR Serverless.
Notes:
The code works fine when writing data into a locally running ES instance (v.8.7.0) from PySpark code.
It also works if I collect the dataframe and then use Python Elasticsearch client with bulk function.
When you're running locally, can you ping Radar_DB from your machine (or whichever machine the spark tasks are running on)? The message makes it sound like your machine can't find an IP address for Radar_DB.
That's actually a very good question with a complicated answer! I believe you can only access the ELB in cloud, and it round-robins to all of the Elasticsearch nodes. That's a little unfortunate because that means it appears as a single node to es-hadoop, and if es-hadoop (or spark) gets too many failures from a node, it will blacklist the node for the remainder of the job. Since there's only one "node", this means that es-hadoop will bail out of your job if it sees a small number of failures, even if in reality you have dozens of nodes behind the load balancer.
One way that people have worked around this is to create several aliases for the load balancer (in their own DNS or even locally in /etc/hosts), and to list the same load balancer multiple times in es.nodes, using the different aliases. I've written this up somewhere before, but I don't remember where. I'll link to it from here if I find it.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.