Crash when reading DataFrame


(M Kiuchi) #1

Environment:
ES: 5.0.0_alpha5
Spark: 1.6.2 ( vanilla-precompiled spark-1.6.2-bin-hadoop2.6.tgz)
Python: 2.6.6
ES-Hadoop: 5.0.0-alpha5

Symptom
Spark with python has succeed to load ES index, but cannot read dataframe and job has crashed.

Any advice ?

Code

#!/usr/bin/python.
# -*- coding: utf-8 -*-
import sys
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
#
# Environment Variables
#
ESHOST="localhost"
ESINDEX="raw-part-2016.08.16/availability"
conf = SparkConf().setAppName("PythonES-Sample")
conf.set('es_nodes' , ESHOST)
conf.set('es_resource', ESINDEX)
conf.set('es_query', "?q=machine_name:m_03")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

df = sqlContext.read.format("org.elasticsearch.spark.sql").load("raw-part-2016.08.16/availability")
df.printSchema() # It works.

ret = df.first() # crash at this function
print ret

log

[m-kiuchi@estest ~]$ ~/spark/bin/spark-submit --driver-class-path=/home/m-kiuchi/es-hadoop/dist/elasticsearch-hadoop-5.0.0-alpha5.jar pyspark-es-4.py
Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
16/08/21 01:51:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
root
 |-- parts_name: string (nullable = true)
 |-- machine_name: string (nullable = true)
 |-- total_rot_360deg: long (nullable = true)
 |-- unix_time: timestamp (nullable = true)

16/08/21 01:51:44 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class
        at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.<init>(AbstractEsRDDIterator.scala:10)
        at org.elasticsearch.spark.sql.ScalaEsRowRDDIterator.<init>(ScalaEsRowRDD.scala:31)
        at org.elasticsearch.spark.sql.ScalaEsRowRDD.compute(ScalaEsRowRDD.scala:27)
        at org.elasticsearch.spark.sql.ScalaEsRowRDD.compute(ScalaEsRowRDD.scala:20)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: scala.collection.GenTraversableOnce$class
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 18 more
(...snip...)
16/08/21 01:51:44 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job

(James Baiera) #2

This error normally occurs when working with mixed Scala versions (2.10 vs. 2.11). What version of Scala are you currently using? Are your artifact and framework versions compatible with that version?


(M Kiuchi) #3

Thanks for reply. I'm using pre-compiled spark ver 1.6.2(for Hadoop 2.6). So I didn't care what version of scala I was using...


(system) #4