Writing to ECK with Pyspark throws an SSLHandshakeException

aldoorozco · April 4, 2024, 12:50am

Hi everyone,

I created an elasticsearch on kubernetes (ECK) deployment and I'm getting an error writing a 1 row dataframe to a new index in Pyspark.

For context, I'm running everything on Google Cloud Platform (GCP): Pyspark version 3.3 runs on Dataproc with the elastic-hadoop connector preconfigured (elasticsearch-spark-30_2.12:8.12.0). For ECK, I created a new Kubernetes Engine (GKE) cluster, installed the CRD using the helm chart (helm install elastic-operator elastic/eck-operator), and kubectl apply the following manifest:

kind: Elasticsearch
metadata:
  name: test
spec:
  version: 8.12.2
  http:
    service:
      spec:
        type: LoadBalancer
  nodeSets:
  - name: masters
    count: 3
    config:
      node.roles: ["master"]
  - name: data
    count: 8
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        ...
    config:
      node.roles: ["data", "ingest"]
    podTemplate:
      spec:
        initContainers:
        - name: sysctl
          securityContext:
            privileged: true
            runAsUser: 0
          command: ['sh', '-c', 'sysctl -w vm.max_map_count=262144']
        containers:
        - name: elasticsearch
          resources:
            ...

The LoadBalancer is created successfully and I can curl the cluster:

curl -k -v -u elastic:$PASSWORD https://$IP:9200/

The output shows that it's using a self-signed certificate

* SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway

I can also create the public certificate kubectl get secret "test-es-http-certs-public" -o go-template='{{index .data "tls.crt" | base64decode }}' > tls.crt and use it to curl which also works fine.

Now, when I try to run the following code from a jupyter cell

from pyspark.sql import SparkSession

spark = SparkSession.builder.master("yarn").appName("Foo").getOrCreate()
data = [{'name': 'Alice', 'age': 1}]
df = spark.createDataFrame(data)

options = {
    "es.index.auto.create": "true",
    "es.net.http.auth.user": "elastic",
    "es.net.http.auth.pass": "<password>",
    "es.nodes": "https://<loadbalancer_ip>:9200",
    "es.nodes.wan.only": "true",
    "es.nodes.discovery": "false",
    "es.net.ssl.cert.allow.self.signed": "true",
    "es.net.ssl":"true",
    "es.resource": "foo/",
}

df.write.mode("overwrite").format("org.elasticsearch.spark.sql").options(**options).save()

I get the following error:

ERROR NetworkClient: Node [<loadbalancer_ip>:9200] failed (javax.net.ssl.SSLHandshakeException: java.security.cert.CertPathValidatorException: Trust anchor for certification path not found.); no other nodes left - aborting...

Note: I have not created a tls.crt file in my pyspark cluster.

Some things I've tried:

tls.selfSignedCertificate.disabled = true (in the manifest) - only renders my cluster unreachable (LibreSSL/3.3.6: error:1404B42E:SSL routines:ST_CONNECT:tlsv1 alert protocol version)
es.net.ssl.cert.allow.self.signed = false - nothing happens

I suspect I need to generate a certificate and addit in one of these configs, but it's not clear to me if that's needed and how to do so.

I would appreciate any help or pointers.

Thanks,
Aldo

aldoorozco · April 11, 2024, 3:48pm

Solved it. In case anyone stumbles on this thread and has the same problem, here are the steps I followed:

In the init script of my cluster I pull the cert file kubectl get secret "test-es-http-certs-public" -o go-template='{{index .data "tls.crt" | base64decode }}' > /tmp/tls.crt
Then, I add it to the keystore as sudo keytool -import -alias elastic -storepass <default_pass> -noprompt -keystore $JAVA_HOME/lib/security/cacerts -file /tmp/tls.crt
Finally, in pyspark I did the following:


# In my case [/usr/lib/jvm/temurin-11-jdk-amd64/]
JAVA_HOME=os.getenv("JAVA_HOME") 

options = {
    "es.index.auto.create": "true",
    "es.net.http.auth.user": "elastic",
    "es.net.http.auth.pass": "<password>",
    "es.nodes": "https://<loadbalancer_ip>:9200",
    "es.nodes.wan.only": "true",
    "es.net.ssl.cert.allow.self.signed": "true",
    "es.net.ssl":"true",
    "es.net.ssl.cert.allow.self.signed", "true"
    "es.net.ssl.keystore.location", "file://{JAVA_HOME}/lib/security/cacerts"
    "es.net.ssl.keystore.pass", "<default_pass>"
    "es.resource": "foo/",
}

df.write.mode("overwrite").format("org.elasticsearch.spark.sql").options(**options).save()

system · May 9, 2024, 3:48pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Spark ES SSL handshake issue Elasticsearch es-hadoop	1	900	July 23, 2020
How do I connect PySpark to Elasticsearch with SSL and verify certs set to False? Ask Question Elasticsearch es-hadoop	5	5683	March 7, 2019
Apache Spark to query Elasticsearch (https and basic authentication) Elasticsearch es-hadoop	3	4143	October 23, 2019
Fail to connect spark-elasticsearch with SSL Elasticsearch elastic-stack-security	5	3868	December 10, 2019
[2021-08-18T14:40:31,056][WARhttp client did not trust this server's certificate, closing connection Netty4HttpChannel Elasticsearch es-hadoop	1	535	September 15, 2021

Writing to ECK with Pyspark throws an SSLHandshakeException

Related topics