Apache Spark to query Elasticsearch (https and basic authentication)

Use case:
Query secure Elasticsearch cluster (https and basic authentication enabled) using Apache Spark(pyspark and spark-submit)

What I tried:

start pyspark as follows:
./bin/pyspark --jars ./jars/elasticsearch-hadoop-7.2.0.jar --files /opt/ssl/jkeystore/elastic --driver-class-path /opt/ssl/jkeystore/elastic --conf "spark.executor.extraJavaOptions=-Djavax.net.ssl.trustStore=elastic" --conf "spark.execurot.extraJavaOptions=-Djavax.net.ssl.trustStorePassword=xxxxxx"

Query Elasticsearch as follows:
df = spark.read.format("org.elasticsearch.spark.sql").option("es.nodes","https://elasticsearch:9200").option("es.resource","index/_doc").option("es.read.field.as.array.include","tags").option("es.net.http.auth.user","user").option("es.net.http.auth.pass","password").option("es.net.ssl","true").load()

I'm getting error as below:
Caused by: org.elasticsearch.hadoop.rest.EsHadoopTransportException: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
Caused by: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

Apparently it looks like spark is unable to understand truststore settings.
Elasticsearch Hadoop doesn't have any options to add certificates to keystore file using secure settings.

How do I configure it correctly so that I can talk to Elasticsearch?

The secure settings in ES-Hadoop are just for storing password configurations so they are not in the job configuration as plaintext. ES-Hadoop does support reading truststore and keystore files using these SSL Settings. It's important to also note, that when specifying truststores or keystores, the files you are referencing are available on the classpath of the driver and worker processes and thus looked up by name, or are available on every node's local filesystem in the same location (make sure to use the file:///full/path/to/keystore format instead of just the path)

So to communicate with SSL enables elasticsearch URL using pyspark?
In elasticsearch library for python, it has option to pass the certificate.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.