Problem reading from Elasticsearch using Spark SQL

Hi,

I am trying to read from Elasticsearch using Spark SQL and getting the
exception below.
My environment is CDH 5.3 with Spark 1.2.0 and Elasticsearch 1.4.4.
Since Spark SQL is not officially supported on CDH 5.3, I added the Hive
Jars to Spark classpath in compute-classpath.sh.
I also added elasticsearch-hadoop-2.1.0.Beta3.jar to the Spark classpath in
compute-classpath.sh.
Also, I tried adding the Hive, elasticsearch-hadoop and elasticseach-spark
Jars to SPARK_CLASSPATH environment variable prior to running spark-submit,
but got the same exception.

Exception in thread "main" java.lang.RuntimeException: Failed to load class
for data source: org.elasticsearch.spark.sql
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.sources.CreateTableUsing.run(ddl.scala:99)
at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:67)
at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:67)
at org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:75)
at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:108)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:303)
at com.informatica.sats.datamgtsrv.Percolator$.main(Percolator.scala:29)
at com.informatica.sats.datamgtsrv.Percolator.main(Percolator.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

The code with I am trying to run:

import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.SparkContext._
import org.elasticsearch.spark._
import org.elasticsearch.spark.sql._

object MyTest
{
def main(args: Array[String])
{
val sparkConf = new SparkConf().setAppName("MyTest")
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)

sqlContext.sql("CREATE TEMPORARY TABLE INTERVALS    " +
               "USING org.elasticsearch.spark.sql " +
               "OPTIONS (resource 'events/intervals') " )
        
val allRDD = sqlContext.sql("SELECT * FROM INTERVALS")

allRDD.foreach(rdd => {rdd.foreach(elem => print(elem + "\n\n"));})

}
}

I checked in Spark source code ( resource
org\apache\spark\sql\sources\ddl.scala ) and saw that the run method in
CreateTableUsing class expects "DefaultSource.class" file for the data
source that needs to be loaded.
However, there is no such class in org.elasticsearch.spark.sql package in
the official Elasticsearch builds.
I checked in following jars:

elasticsearch-spark_2.10-2.1.0.Beta3.jar
elasticsearch-spark_2.10-2.1.0.Beta2.jar
elasticsearch-spark_2.10-2.1.0.Beta1.jar
elasticsearch-hadoop-2.1.0.Beta3.jar

Can you please advise why this problem happens and how to resolve it?

Thanks,
Dmitriy Fingerman

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9162e283-e458-4d42-ab53-5ca50fa08172%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I found that to solve this problem I need to use BUILD-SNAPSHOT version of
elasticsearch-hadoop.

After adding below entries to pom.xml it started to work.

...

sonatype-oss
http://oss.sonatype.org/content/repositories/snapshots
true

...

org.elasticsearch
elasticsearch-hadoop
2.1.0.BUILD-SNAPSHOT

On Thursday, March 26, 2015 at 10:12:25 AM UTC-4, Dmitriy Fingerman wrote:

Hi,

I am trying to read from Elasticsearch using Spark SQL and getting the
exception below.
My environment is CDH 5.3 with Spark 1.2.0 and Elasticsearch 1.4.4.
Since Spark SQL is not officially supported on CDH 5.3, I added the Hive
Jars to Spark classpath in compute-classpath.sh.
I also added elasticsearch-hadoop-2.1.0.Beta3.jar to the Spark classpath
in compute-classpath.sh.
Also, I tried adding the Hive, elasticsearch-hadoop and elasticseach-spark
Jars to SPARK_CLASSPATH environment variable prior to running spark-submit,
but got the same exception.

Exception in thread "main" java.lang.RuntimeException: Failed to load
class for data source: org.elasticsearch.spark.sql
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.sources.CreateTableUsing.run(ddl.scala:99)
at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:67)
at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:67)
at
org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:75)
at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:108)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:303)
at com.informatica.sats.datamgtsrv.Percolator$.main(Percolator.scala:29)
at com.informatica.sats.datamgtsrv.Percolator.main(Percolator.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

The code with I am trying to run:

import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.SparkContext._
import org.elasticsearch.spark._
import org.elasticsearch.spark.sql._

object MyTest
{
def main(args: Array[String])
{
val sparkConf = new SparkConf().setAppName("MyTest")
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)

sqlContext.sql("CREATE TEMPORARY TABLE INTERVALS    " +
               "USING org.elasticsearch.spark.sql " +
               "OPTIONS (resource 'events/intervals') " )
        
val allRDD = sqlContext.sql("SELECT * FROM INTERVALS")

allRDD.foreach(rdd => {rdd.foreach(elem => print(elem + "\n\n"));})

}
}

I checked in Spark source code ( resource
org\apache\spark\sql\sources\ddl.scala ) and saw that the run method in
CreateTableUsing class expects "DefaultSource.class" file for the data
source that needs to be loaded.
However, there is no such class in org.elasticsearch.spark.sql package in
the official Elasticsearch builds.
I checked in following jars:

elasticsearch-spark_2.10-2.1.0.Beta3.jar
elasticsearch-spark_2.10-2.1.0.Beta2.jar
elasticsearch-spark_2.10-2.1.0.Beta1.jar
elasticsearch-hadoop-2.1.0.Beta3.jar

Can you please advise why this problem happens and how to resolve it?

Thanks,
Dmitriy Fingerman

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c6dc8980-a8b6-495d-88ad-477662b7b66e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.