Use Spark to index data in HDFS


(Swapnil Narlawar) #1

Hi there,
We are looking at simplest and fastest way to get data from HDFS to ES.
One method we have been trying and having out of luck is ES- Hadoop with Spark
Component versions we are using are below
Versions;
ES - 1.7.3
Spark - 1.5.2
Scala - 2.10.4
JAVA - 1.7.0_67

-#We initiate spark shell with following jar files.

./spark-shell --jars esjava/elasticsearch-spark_2.11-2.1.2.jar esjava/elasticsearch-hadoop-mr-2.1.2.jar elasticsearch-hadoop-2.1.2.jar

-#Below we import following classes

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.elasticsearch.spark._
import org.apache.spark.SparkConf
import org.elasticsearch.spark.rdd.EsSpark
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.rdd.RDD
import org.elasticsearch.spark.sql._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SQLContext._

-# Below, we define our elastic search master

val conf = new SparkConf()
conf.set("es.nodes","hostname:9200")

-# Below we point json file and convert it to dataframe
val sqlContext = new SQLContext(sc)
val df = sqlContext.jsonFile("hdfs://namenode/tmp/2015-11-10.json")

-# Below We validate the schema
println(df.printSchema)

-# and below save it to Elastic
df.saveToEs("test/parquet")

-#And right after that where we get following error, not sure what we are doing wrong.

Error
java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
at org.elasticsearch.spark.sql.EsSparkSQL$.saveToEs(EsSparkSQL.scala:42)
at org.elasticsearch.spark.sql.package$SparkDataFrameFunctions.saveToEs(package.scala:25)

Any help is appreciated.


(Costin Leau) #2

./spark-shell --jars esjava/elasticsearch-spark_2.11-2.1.2.jar esjava/elasticsearch-hadoop-mr-2.1.2.jar elasticsearch-hadoop-2.1.2.jar

You are setting the classpath incorrectly - you are pulling in 3 different jars, that overlap in functionality and packages for no reason at all. Use only elasticsearch-spark as indicated by the docs.

In addition, you are using the elasticsearch-spark compiled for Scala 2.11 while using Scala 2.10.
These details indicate you are fairly new to Scala and Spark and are skipping simple yet critical details in the setup. Stop rushing, take a step back and start again paying attention to details - it might seem that you are moving slowly but getting stuck on bugs like these, is likely going to burn more time.


(system) #3