Elastic - Spark connector failing to read data

ljSolaiman · May 24, 2023, 2:46pm

Hi all,

I am trying to read data from Elasticsearch to Databricks (Spark) but I'm getting the following error:

org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'

which is symptomatic of a wrong driver version according to your documentation

I'm running

Databricks runtime version 13.0 (includes Apache Spark 3.4.0, Scala 2.12)
*Elasticsearch version 8.5.2
I thus installed org.elasticsearch:elasticsearch-spark-30_2.12:8.5.2 from Maven on the Databricks cluster

From a networking perspective, I’m able to telnet elastic.
However, I’m not able to pull data from Elastic server using the following command

df = (spark.read
      .format( "org.elasticsearch.spark.sql" )
      .option( "spark.es.nodes",   hostname )
      .option( "spark.es.port",    port     )
      .option( "spark.es.nodes.wan.only", "true" )
      .option("spark.es.net.ssl", "true")
      .option("spark.es.net.http.auth.user", username) 
      .option("spark.es.net.http.auth.pass", password)  
      .load( f"{index}" )
     )
display(df)

siuser · May 26, 2023, 8:59pm

Hello,

I have the same issue. Were you able to resolve this and connect to ES?

ljSolaiman · May 26, 2023, 9:15pm

Hi Si,
No not yet. Anyone to help us from Elastic community?

stephenb · May 27, 2023, 2:31pm

Hi @ljSolaiman

I know nothing about that connector but a common issue is if you are using self signed search.

Instead telnet from the client server can you try this and show the results

curl -v -u elastic:password https://hostname:port

ljSolaiman · May 29, 2023, 5:59am

Hi @stephenb,

Here is what I get from the curl command (masked IP address and hostname).

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying YYY.YY.YYY.YY:9200...

Connected to XXXXX (YYY.YY.YYY.YY) port 9200 (#0)
ALPN, offering h2
ALPN, offering http/1.1
CAfile: /etc/ssl/certs/ca-certificates.crt
CApath: /etc/ssl/certs
TLSv1.0 (OUT), TLS header, Certificate Status (22):
} [5 bytes data]
TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
TLSv1.2 (IN), TLS header, Certificate Status (22):
{ [5 bytes data]
TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [122 bytes data]
TLSv1.2 (IN), TLS header, Finished (20):
{ [5 bytes data]
TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
{ [32 bytes data]
TLSv1.3 (IN), TLS handshake, Certificate (11):
{ [1687 bytes data]
TLSv1.2 (OUT), TLS header, Unknown (21):
} [5 bytes data]
TLSv1.3 (OUT), TLS alert, unknown CA (560):
} [2 bytes data]
SSL certificate problem: unable to get local issuer certificate
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
Closing connection 0
curl: (60) SSL certificate problem: unable to get local issuer certificate

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

stephenb · May 29, 2023, 6:12am

Ok now try

curl -k -v -u elastic:password https://hostname:port

You most likely have a self signed certificate on Elasticsearch so you will need to look at

I also see this configuration on this page

es.net.ssl.cert.allow.self.signed (default false)
Whether or not to allow self signed certificates

Perhaps set that to true

ljSolaiman · June 1, 2023, 11:39am

I believe this is a network connectivity issue. Curl works but not the spark.read command. Any idea where this can come from? Thanks
This is my latest trace below:

Connecting ElasticSeach from Databricks notebook. Cluster runs on single node mode.

Following Curl command works:

curl -k -u user:pwd https...XXX.XX.XXX.XX

Pyspark code:

df = (spark.read.format("org.elasticsearch.spark.sql")

      .option("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

      .option("es.nodes", "XXX.XX.XXX.XX")

      .option("es.port", "9200")

      .option("es.nodes.wan.only", "true")

      .load("index" )

     )

above code throws the error

Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[172.23.216.29:9200]]

at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:160)

at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:442)

at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:438)

at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:406)

at org.elasticsearch.hadoop.rest.RestClient.mainInfo(RestClient.java:755)

at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:393)

at org.elasticsearch.spark.sql.ElasticsearchRelation.cfg$lzycompute(DefaultSource.scala:234)

at org.elasticsearch.spark.sql.ElasticsearchRelation.cfg(DefaultSource.scala:231)

at org.elasticsearch.spark.sql.ElasticsearchRelation.lazySchema$lzycompute(DefaultSource.scala:238)

at org.elasticsearch.spark.sql.ElasticsearchRelation.lazySchema(DefaultSource.scala:238)

at org.elasticsearch.spark.sql.ElasticsearchRelation.$anonfun$schema$1(DefaultSource.scala:242)

at scala.Option.getOrElse(Option.scala:189)

at org.elasticsearch.spark.sql.ElasticsearchRelation.schema(DefaultSource.scala:242)

stephenb · June 1, 2023, 1:57pm

Did you try setting the self-sign cert setting that I showed?

That was the point of my entire post.

The -k in the curl command allows for self-signed certs That was the point of running it.

If you run curl without the -k, I suspect it will fail.

If curl works from the same box, it's unlikely but not impossible that it's connectivity issue.

Try the curl without -k I suspect it won't work.

Report back. Let's see what we can learn

system · June 29, 2023, 1:58pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Problems connecting to ES from Databricks using spark connector Elasticsearch es-hadoop	3	690	June 5, 2023
Connector for Elastic Search 8.6.2 and databricks spark 3.4.0 Elasticsearch es-hadoop	9	1509	October 27, 2023
EsHadoopIllegalArgumentException: Cannot detect ES version Elasticsearch es-hadoop	2	563	October 6, 2023
Error writing to Elastic search from Databricks Elasticsearch es-hadoop	6	287	March 6, 2024
Not able to detect ES version Elasticsearch es-hadoop	2	9121	September 11, 2018

Elastic - Spark connector failing to read data

Related topics