Pyspark-Elasticsearch connectivity and latest version compatibilty

Bramha · February 10, 2023, 6:22pm

Hello All,

I'm trying to use elasticsearch as a source in my poc. I'm using Python-3.10.10, Spark 3.3.0.

I'm using Jupyter notebook

below is my code:

from pyspark import SparkConf, SparkContext

conf = SparkConf()
conf.setAppName("My Jupyter Notebook with Elasticsearch")
conf.set("spark.jars", "C:\Big_Data_Setup\Spark\jars\elasticsearch-spark-20_2.11-8.6.1")

spark = SparkContext.getOrCreate(conf)

from pyspark.sql import SparkSession

spark = SparkSession.builder.config(conf=conf).getOrCreate()

df = spark.read
.format("elasticsearch")
.option("es.nodes", "https:xxx:9200")
.option("es.query", '{"query": {"match_all": {}}}')
.option("inferSchema", "true")
.option("es.read.field.as.array.include", "tags")
.option("es.net.http.auth.user", "user")
.option("es.net.http.auth.pass", "pwd")
.load("myindex")

after running this I'm having below error:
Py4JJavaError: An error occurred while calling o43.load.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.elasticsearch.spark.sql.DefaultSource15 could not be instantiated
at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:586)
at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:813)
at java.base/java.util.ServiceLoader$ProviderImpl.get(ServiceLoader.java:729)
at java.base/java.util.ServiceLoader$3.next(ServiceLoader.java:1403)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303)
at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:29

As you can see in my code I have already add dependency still I'm having issue.

I'm new to this would appreciate your help.

Thanks !

Keith_Massey · February 10, 2023, 7:00pm

How are you starting pyspark? For example, here is how I do it:

/home/elastic/spark-3.2.1-bin-hadoop3.2/bin/pyspark --master yarn --deploy-mode client --jars /home/elastic/elasticsearch-spark-30_2.12-8.6.0.jar

Also, for spark 3.3.0 you'll want to use either elasticsearch-spark-30_2.12-8.6.1.jar or elasticsearch-spark-30_2.13-8.6.1.jar. That 20 in your version is referring to spark 2.x, and the 2.11 is referring to scala 2.11 (Spark 3.3.0 uses scala 2.12 or 2.13 I believe). That's not the cause of your current problem, but it will be the next problem -- spark and scala aren't generally compatible across versions.

Bramha · February 19, 2023, 7:38pm

Hello Keith,

Thanks for responding as you can see in my post I'm using Jupiter notebook and I'm creating spark session, posted code snippet.

Is there any good documentation on end to end process for connectivity between pyspark and elasticsearch specially on local machine (single node cluster) ?

Keith_Massey · February 21, 2023, 3:00pm

You didn't mention how you're starting jupyter, but you probably just need to pass that --jars /home/elastic/elasticsearch-spark-30_2.12-8.6.0.jar piece to Jupyter notebook when you're starting it up, probably via the PYSPARK_DRIVER_PYTHON_OPTS environment variable. The exact way to do that might be better answered by the Jupyter notebook community though -- I've only used es-spark directly through pyspark personally.

Bramha · February 24, 2023, 6:04am

okay I'll try that once. Really appreciate your time and efforts .

Bramha · February 24, 2023, 8:28am

I have added required dependencies however now I'm getting different error.

Py4JJavaError: An error occurred while calling o43.load.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: org.elasticsearch.spark.sql.DefaultSource15 Unable to get public no-arg constructor
at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:586)
at java.base/java.util.ServiceLoader.getConstructor(ServiceLoader.java:679)
at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.hasNextService(ServiceLoader.java:1240)
at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.hasNext(ServiceLoader.java:1273)
at java.base/java.util.ServiceLoader$2.hasNext(ServiceLoader.java:1309)
at java.base/java.util.ServiceLoader$3.hasNext(ServiceLoader.java:1393)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
at scala.collection.Iterator.foreach(Iterator.scala:943)

Please guide me here if you have any lead.

Keith_Massey · February 24, 2023, 2:24pm

I am guessing that your spark and/or scala version don't match up with your es-spark jar version. For Spark 3.3.0 you want to use elasticsearch-spark-30_2.12-8.6.1 or elasticsearch-spark-30_2.13-8.6.1, depending on whether your spark is using scala 2.12 or 2.13. Check out Python, Elasticsearch and Apache Spark, Simple data reading - #2 by Keith_Massey.

system · March 24, 2023, 2:24pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to run ES-Hadoop in Jupyter Notebook (Python or Scala) Elasticsearch es-hadoop	3	2243	July 7, 2018
Not able to connect with pyspark with elastic search, getting error Cannot detect ES version Elasticsearch	2	16	November 1, 2024
Connecting elastic search through pyspark Elasticsearch es-hadoop	2	1701	December 7, 2021
Jupyter spark connect to elasticsearch Elasticsearch docker , es-hadoop	12	1439	March 29, 2023
ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.elasticsearch.spark.sql.DefaultSource15 could not be instantiated Elasticsearch	1	2850	June 13, 2020

Pyspark-Elasticsearch connectivity and latest version compatibilty

Related topics