Writing PySpark dataframe to Elastic Cloud (Cannot detect ES version)

Keith_Massey · May 24, 2023, 2:00pm

That's actually a very good question with a complicated answer! I believe you can only access the ELB in cloud, and it round-robins to all of the Elasticsearch nodes. That's a little unfortunate because that means it appears as a single node to es-hadoop, and if es-hadoop (or spark) gets too many failures from a node, it will blacklist the node for the remainder of the job. Since there's only one "node", this means that es-hadoop will bail out of your job if it sees a small number of failures, even if in reality you have dozens of nodes behind the load balancer.
One way that people have worked around this is to create several aliases for the load balancer (in their own DNS or even locally in /etc/hosts), and to list the same load balancer multiple times in es.nodes, using the different aliases. I've written this up somewhere before, but I don't remember where. I'll link to it from here if I find it.

Topic		Replies	Views
PySpark writing to ES: "Cannot detect ES version" Elasticsearch es-hadoop	10	131	July 24, 2024
Error with pyspark connect es Elasticsearch es-hadoop	1	900	September 24, 2020
EsHadoopIllegalArgumentException: Cannot detect ES version Elasticsearch es-hadoop	2	561	October 6, 2023
Writing from spark to elasticsearch fails Elasticsearch es-hadoop	2	1009	August 14, 2017
Not able to detect ES version Elasticsearch es-hadoop	2	9121	September 11, 2018

Writing PySpark dataframe to Elastic Cloud (Cannot detect ES version)

Related topics