Error load as a DataFrame

(Robert) #1

I try to load a index as a DF with this command :
val sqlc = new SQLContext(sc)
val usersDF = sqlc.esDF("users-2016/person")

after that, I register this as a TempTable --> usersDF.registerTempTable("users")

and I try to execute a simple query:
val result =sqlc.sql("SELECT count(*) FROM users")

but.... I always get the same error:
java.util.NoSuchElementException : key not found : name

with usersDF.printSchema() return this:

|-- @timestamp: timestamp (nullable = true)
|-- @version: string (nullable = true)
|-- dst_ip: string (nullable = true)
|-- name: string (nullable = true)
|-- request: string (nullable = true)
|-- last_name: string (nullable = true)

The index has 8 millions of documents
How I can solve this problems?

(Costin Leau) #2

What version of ES-Hadoop are you using and what version of Spark? Also what does your mapping in ES looks like?
Also what's full strack trace that you get?

(Robert) #3

first, thanks for answering, the Spark version is 1.4.2 and ES-Hadoop is 2.2.0-m1. I think that the problem is due to not all fields are present in the elasticsearch Index, for example the field "name" is not present in 500k documents

(Costin Leau) #4

Please use es-hadoop version 2.2 GA. A lot of things have been fixed vs m1
(which was a milestone).

(Robert) #5


I've used 2.2.0 GA and now I get this error:

16/02/22 10:44:12 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.elasticsearch.hadoop.EsHadoopIllegalStateException: Field 'positions.start_date' not found; typically this occurs with arrays which are not mapped as single value
at org.elasticsearch.spark.sql.RowValueReader$class.rowColumns(RowValueReader.scala:34)
at org.elasticsearch.spark.sql.ScalaRowValueReader.rowColumns(ScalaEsRowValueReader.scala:14)
at org.elasticsearch.spark.sql.ScalaRowValueReader.createMap(ScalaEsRowValueReader.scala:42)
at org.elasticsearch.hadoop.serialization.ScrollReader.list(
at org.elasticsearch.hadoop.serialization.ScrollReader.readHitAsMap(
at org.elasticsearch.hadoop.serialization.ScrollReader.readHit(
at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:43)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.executor.Executor$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$

(Costin Leau) #6

Repeating myself, what is your mapping in ES? What does the docs look like? Do they contain arrays (as the exception message indicates)?

(system) #7