Pyspark-Elasticsearch connectivity and latest version compatibilty

Hello All,

I'm trying to use elasticsearch as a source in my poc. I'm using Python-3.10.10, Spark 3.3.0.

I'm using Jupyter notebook

below is my code:

from pyspark import SparkConf, SparkContext

conf = SparkConf()
conf.setAppName("My Jupyter Notebook with Elasticsearch")
conf.set("spark.jars", "C:\Big_Data_Setup\Spark\jars\elasticsearch-spark-20_2.11-8.6.1")

spark = SparkContext.getOrCreate(conf)

from pyspark.sql import SparkSession

spark = SparkSession.builder.config(conf=conf).getOrCreate()

df = spark.read
.format("elasticsearch")
.option("es.nodes", "https:xxx:9200")
.option("es.query", '{"query": {"match_all": {}}}')
.option("inferSchema", "true")
.option("es.read.field.as.array.include", "tags")
.option("es.net.http.auth.user", "user")
.option("es.net.http.auth.pass", "pwd")
.load("myindex")

after running this I'm having below error:
Py4JJavaError: An error occurred while calling o43.load.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.elasticsearch.spark.sql.DefaultSource15 could not be instantiated
at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:586)
at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:813)
at java.base/java.util.ServiceLoader$ProviderImpl.get(ServiceLoader.java:729)
at java.base/java.util.ServiceLoader$3.next(ServiceLoader.java:1403)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303)
at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:29

As you can see in my code I have already add dependency still I'm having issue.

I'm new to this would appreciate your help.

Thanks !

How are you starting pyspark? For example, here is how I do it:

/home/elastic/spark-3.2.1-bin-hadoop3.2/bin/pyspark --master yarn --deploy-mode client --jars /home/elastic/elasticsearch-spark-30_2.12-8.6.0.jar

Also, for spark 3.3.0 you'll want to use either elasticsearch-spark-30_2.12-8.6.1.jar or elasticsearch-spark-30_2.13-8.6.1.jar. That 20 in your version is referring to spark 2.x, and the 2.11 is referring to scala 2.11 (Spark 3.3.0 uses scala 2.12 or 2.13 I believe). That's not the cause of your current problem, but it will be the next problem -- spark and scala aren't generally compatible across versions.

Hello Keith,

Thanks for responding as you can see in my post I'm using Jupiter notebook and I'm creating spark session, posted code snippet.

Is there any good documentation on end to end process for connectivity between pyspark and elasticsearch specially on local machine (single node cluster) ?

You didn't mention how you're starting jupyter, but you probably just need to pass that --jars /home/elastic/elasticsearch-spark-30_2.12-8.6.0.jar piece to Jupyter notebook when you're starting it up, probably via the PYSPARK_DRIVER_PYTHON_OPTS environment variable. The exact way to do that might be better answered by the Jupyter notebook community though -- I've only used es-spark directly through pyspark personally.

1 Like

okay I'll try that once. Really appreciate your time and efforts .

I have added required dependencies however now I'm getting different error.

Py4JJavaError: An error occurred while calling o43.load.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: org.elasticsearch.spark.sql.DefaultSource15 Unable to get public no-arg constructor
at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:586)
at java.base/java.util.ServiceLoader.getConstructor(ServiceLoader.java:679)
at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.hasNextService(ServiceLoader.java:1240)
at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.hasNext(ServiceLoader.java:1273)
at java.base/java.util.ServiceLoader$2.hasNext(ServiceLoader.java:1309)
at java.base/java.util.ServiceLoader$3.hasNext(ServiceLoader.java:1393)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
at scala.collection.Iterator.foreach(Iterator.scala:943)

Please guide me here if you have any lead.

I am guessing that your spark and/or scala version don't match up with your es-spark jar version. For Spark 3.3.0 you want to use elasticsearch-spark-30_2.12-8.6.1 or elasticsearch-spark-30_2.13-8.6.1, depending on whether your spark is using scala 2.12 or 2.13. Check out Python, Elasticsearch and Apache Spark, Simple data reading - #2 by Keith_Massey.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.