Environment:
ES: 5.0.0_alpha5
Spark: 1.6.2 ( vanilla-precompiled spark-1.6.2-bin-hadoop2.6.tgz)
Python: 2.6.6
ES-Hadoop: 5.0.0-alpha5
Symptom
Spark with python has succeed to load ES index, but cannot read dataframe and job has crashed.
Any advice ?
Code
#!/usr/bin/python.
# -*- coding: utf-8 -*-
import sys
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
#
# Environment Variables
#
ESHOST="localhost"
ESINDEX="raw-part-2016.08.16/availability"
conf = SparkConf().setAppName("PythonES-Sample")
conf.set('es_nodes' , ESHOST)
conf.set('es_resource', ESINDEX)
conf.set('es_query', "?q=machine_name:m_03")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
df = sqlContext.read.format("org.elasticsearch.spark.sql").load("raw-part-2016.08.16/availability")
df.printSchema() # It works.
ret = df.first() # crash at this function
print ret
log
[m-kiuchi@estest ~]$ ~/spark/bin/spark-submit --driver-class-path=/home/m-kiuchi/es-hadoop/dist/elasticsearch-hadoop-5.0.0-alpha5.jar pyspark-es-4.py
Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
16/08/21 01:51:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
root
|-- parts_name: string (nullable = true)
|-- machine_name: string (nullable = true)
|-- total_rot_360deg: long (nullable = true)
|-- unix_time: timestamp (nullable = true)
16/08/21 01:51:44 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class
at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.<init>(AbstractEsRDDIterator.scala:10)
at org.elasticsearch.spark.sql.ScalaEsRowRDDIterator.<init>(ScalaEsRowRDD.scala:31)
at org.elasticsearch.spark.sql.ScalaEsRowRDD.compute(ScalaEsRowRDD.scala:27)
at org.elasticsearch.spark.sql.ScalaEsRowRDD.compute(ScalaEsRowRDD.scala:20)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: scala.collection.GenTraversableOnce$class
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 18 more
(...snip...)
16/08/21 01:51:44 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job