Elastic - Spark connector failing to read data

Hi all,

I am trying to read data from Elasticsearch to Databricks (Spark) but I'm getting the following error:

org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'

which is symptomatic of a wrong driver version according to your documentation

I'm running

  • Databricks runtime version 13.0 (includes Apache Spark 3.4.0, Scala 2.12)
    *Elasticsearch version 8.5.2
  • I thus installed org.elasticsearch:elasticsearch-spark-30_2.12:8.5.2 from Maven on the Databricks cluster

From a networking perspective, I’m able to telnet elastic.
However, I’m not able to pull data from Elastic server using the following command

df = (spark.read
      .format( "org.elasticsearch.spark.sql" )
      .option( "spark.es.nodes",   hostname )
      .option( "spark.es.port",    port     )
      .option( "spark.es.nodes.wan.only", "true" )
      .option("spark.es.net.ssl", "true")
      .option("spark.es.net.http.auth.user", username) 
      .option("spark.es.net.http.auth.pass", password)  
      .load( f"{index}" )
     )
display(df)

Hello,

I have the same issue. Were you able to resolve this and connect to ES?

Hi Si,
No not yet. Anyone to help us from Elastic community?

Hi @ljSolaiman

I know nothing about that connector but a common issue is if you are using self signed search.

Instead telnet from the client server can you try this and show the results

curl -v -u elastic:password https://hostname:port

Hi @stephenb,

Here is what I get from the curl command (masked IP address and hostname).

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying YYY.YY.YYY.YY:9200...

  • Connected to XXXXX (YYY.YY.YYY.YY) port 9200 (#0)
  • ALPN, offering h2
  • ALPN, offering http/1.1
  • CAfile: /etc/ssl/certs/ca-certificates.crt
  • CApath: /etc/ssl/certs
  • TLSv1.0 (OUT), TLS header, Certificate Status (22):
    } [5 bytes data]
  • TLSv1.3 (OUT), TLS handshake, Client hello (1):
    } [512 bytes data]
  • TLSv1.2 (IN), TLS header, Certificate Status (22):
    { [5 bytes data]
  • TLSv1.3 (IN), TLS handshake, Server hello (2):
    { [122 bytes data]
  • TLSv1.2 (IN), TLS header, Finished (20):
    { [5 bytes data]
  • TLSv1.2 (IN), TLS header, Supplemental data (23):
    { [5 bytes data]
  • TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
    { [32 bytes data]
  • TLSv1.3 (IN), TLS handshake, Certificate (11):
    { [1687 bytes data]
  • TLSv1.2 (OUT), TLS header, Unknown (21):
    } [5 bytes data]
  • TLSv1.3 (OUT), TLS alert, unknown CA (560):
    } [2 bytes data]
  • SSL certificate problem: unable to get local issuer certificate
    0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
  • Closing connection 0
    curl: (60) SSL certificate problem: unable to get local issuer certificate

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

Ok now try

curl -k -v -u elastic:password https://hostname:port

You most likely have a self signed certificate on Elasticsearch so you will need to look at

I also see this configuration on this page

es.net.ssl.cert.allow.self.signed (default false)
Whether or not to allow self signed certificates

Perhaps set that to true

I believe this is a network connectivity issue. Curl works but not the spark.read command. Any idea where this can come from? Thanks
This is my latest trace below:

Connecting ElasticSeach from Databricks notebook. Cluster runs on single node mode.

Following Curl command works:

curl -k -u user:pwd https...XXX.XX.XXX.XX

Pyspark code:

df = (spark.read.format("org.elasticsearch.spark.sql")

      .option("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

      .option("es.nodes", "XXX.XX.XXX.XX")

      .option("es.port", "9200")

      .option("es.nodes.wan.only", "true")

      .load("index" )

     )

above code throws the error

Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[172.23.216.29:9200]]

at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:160)

at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:442)

at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:438)

at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:406)

at org.elasticsearch.hadoop.rest.RestClient.mainInfo(RestClient.java:755)

at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:393)

at org.elasticsearch.spark.sql.ElasticsearchRelation.cfg$lzycompute(DefaultSource.scala:234)

at org.elasticsearch.spark.sql.ElasticsearchRelation.cfg(DefaultSource.scala:231)

at org.elasticsearch.spark.sql.ElasticsearchRelation.lazySchema$lzycompute(DefaultSource.scala:238)

at org.elasticsearch.spark.sql.ElasticsearchRelation.lazySchema(DefaultSource.scala:238)

at org.elasticsearch.spark.sql.ElasticsearchRelation.$anonfun$schema$1(DefaultSource.scala:242)

at scala.Option.getOrElse(Option.scala:189)

at org.elasticsearch.spark.sql.ElasticsearchRelation.schema(DefaultSource.scala:242)

Did you try setting the self-sign cert setting that I showed?

That was the point of my entire post.

The -k in the curl command allows for self-signed certs That was the point of running it.

If you run curl without the -k, I suspect it will fail.

If curl works from the same box, it's unlikely but not impossible that it's connectivity issue.

Try the curl without -k I suspect it won't work.

Report back. Let's see what we can learn

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.