PySpark writing to ES: "Cannot detect ES version"

rzotti-sd · July 23, 2024, 10:06pm

I'm using the following PySpark code to write a DataFrame to an Elasticsearch cluster hosted in Elastic Cloud.

df.writeStream.outputMode("append")
        .format("org.elasticsearch.spark.sql")
        .option("checkpointLocation", "s3a://example/abc")
        .option("es.nodes.wan.only", "true")
        .option("es.nodes", "https://example.es.us-west-1.aws.found.io")
        .option("es.port", 443)
        .option("es.net.http.auth.user", user)
        .option("es.net.http.auth.pass", "************")
        .option("es.resource", index_name)
        .option("es.mapping.id", id_column)
        .option("es.write.operation", "upsert")
        .start()

But I get the following error:

: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'

This seems to be a common error posted in these forums, and the typical solutions are:

Set "es.nodes.wan.only" to "true"
Ensure "es.nodes" does not point to a Cloud ID but to an https endpoint
Ensure "es.net.http.auth.user" and "es.net.http.auth.pass" values are correctly set

I've done all of those things, and in the past they were sufficient. In fact, my exact same code and configuration is able to write to an Elasticsearch cluster running version 8.11.3. The new cluster that I need to write to has version 8.13.2, and that's when I get the error, so I suspect something changed between those two versions.

I've tried using several elasticsearch-spark jar versions:

org.elasticsearch:elasticsearch-spark-30_2.12:8.14.3
org.elasticsearch:elasticsearch-spark-30_2.12:8.13.2

But none change the error.

leandrojmp · July 24, 2024, 12:13am

Is this cluster also on Elastic Cloud?

I'm not using Elastic Cloud at the moment, but if I'm not wrong the port used is 9243 not 443.

stephenb · July 24, 2024, 5:22am

Hi @rzotti-sd cc @leandrojmp

Actually :443 works for all endpoints now, for elasticsearch :9243 is "legacy" but still supported BUT you DO need to add a port otherwise most client libraries will default to the actual default of :9200 so when I see this

.option("es.nodes", "https://example.es.us-west-1.aws.found.io")

it is probably defaulting to :9200 which will not work so put in the correct port and it should work

.option("es.nodes", "https://example.es.us-west-1.aws.found.io:443")
and give it a try...

rzotti-sd · July 24, 2024, 4:20pm

Both versions are running in Elastic Cloud.

I tried .option("es.nodes", "https://example.es.us-west-1.aws.found.io:443") and got the same error.

Is there any way I can turn on verbose logging? Or debug locally without using Spark but using the jar directly, just to test that the connection works?

stephenb · July 24, 2024, 7:08pm

Try a curl with the same url an user and password

curl -u user https://example.es.us-west-1.aws.found.io:443

Ohh and @rzotti-sd welcome to the community

rzotti-sd · July 24, 2024, 8:16pm

Connecting using your provided curl command works. I'm also able to connect via Python's elasticsearch library too.

stephenb · July 24, 2024, 8:25pm

I am not that familiar with that client...

can you try the actual endpoint

something like

https://sdfgsdfgf52872baf38dfb21236.us-west-1.aws.found.io:443

Test with the curl first notice no .es.

It could be a bug.... but I would think that would have been caught in the automated tests...

Can you show more of the stack trace ... anything else interesting?

Keith_Massey · July 24, 2024, 9:57pm

There ought to be more in the stack trace. Es-hadoop gives that Cannot detect ES version... error message for any exception it catches while trying to connect to the cluster. Sometimes it can be misleading. But the "caused by" portion of the stack trace ought to tell us more.

Keith_Massey · July 24, 2024, 9:57pm

Added es-hadoop and removed #elastic-cloud

rzotti-sd · July 24, 2024, 10:17pm

That was it!

Switching to the actual, garbled endpoint, like https://sdfgsdfgf52872baf38dfb21236.us-west-1.aws.found.io:443, did the trick.

When I ran your original curl command, I actually didn't get back anything. I also didn't get an error, so I figured everything was fine and not worth mentioning. But when I use the actual endpoint in the curl command, I get back something like the following:

{
  "name" : "instance-....",
  "cluster_name" : "...",
  "cluster_uuid" : "...",
  "version" : {
    "number" : "8.13.2",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "...",
    "build_date" : "2024-04-05T14:45:26.420424304Z",
    "build_snapshot" : false,
    "lucene_version" : "9.10.0",
    "minimum_wire_compatibility_version" : "7.17.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "You Know, for Search"
}

Passing the https://example.es.us-west-1.aws.found.io endpoint to curl when calling the 8.11.3 cluster gives a similar payload to when I use the actual endpoint with the 8.13.2 cluster. Updating my Spark code to point to the real endpoint works too, and I'm able to see data flow through and get added to the index.

How did you know to use the real endpoint and not the cleaner looking endpoint? Is that documented somewhere?

By the way, thank you so much for your fast responses and help. I spent 4 days working on the issue. I'm so relieved it's resolved.

stephenb · July 24, 2024, 11:55pm

That means the URL was not or is not correct. ..

Whatever works with the curl should probably work like that library

Topic		Replies	Views
Writing PySpark dataframe to Elastic Cloud (Cannot detect ES version) Elasticsearch es-hadoop	7	2369	June 21, 2023
EsHadoopIllegalArgumentException: Cannot detect ES version Elasticsearch es-hadoop	2	608	October 6, 2023
Error with pyspark connect es Elasticsearch es-hadoop	1	908	September 24, 2020
Not able to detect ES version Elasticsearch es-hadoop	2	9242	September 11, 2018
Writing from spark to elasticsearch fails Elasticsearch es-hadoop	2	1019	August 14, 2017

PySpark writing to ES: "Cannot detect ES version"

Related topics