Error load as a DataFrame


(Robert) #1

I try to load a index as a DF with this command :
val sqlc = new SQLContext(sc)
val usersDF = sqlc.esDF("users-2016/person")

after that, I register this as a TempTable --> usersDF.registerTempTable("users")

and I try to execute a simple query:
val result =sqlc.sql("SELECT count(*) FROM users")
result.show()

but.... I always get the same error:
java.util.NoSuchElementException : key not found : name

with usersDF.printSchema() return this:

root
|-- @timestamp: timestamp (nullable = true)
|-- @version: string (nullable = true)
|-- dst_ip: string (nullable = true)
|-- name: string (nullable = true)
|-- request: string (nullable = true)
|-- last_name: string (nullable = true)

The index has 8 millions of documents
How I can solve this problems?


(Costin Leau) #2

What version of ES-Hadoop are you using and what version of Spark? Also what does your mapping in ES looks like?
Also what's full strack trace that you get?


(Robert) #3

first, thanks for answering, the Spark version is 1.4.2 and ES-Hadoop is 2.2.0-m1. I think that the problem is due to not all fields are present in the elasticsearch Index, for example the field "name" is not present in 500k documents


(Costin Leau) #4

Please use es-hadoop version 2.2 GA. A lot of things have been fixed vs m1
(which was a milestone).


(Robert) #5

HI,

I've used 2.2.0 GA and now I get this error:

16/02/22 10:44:12 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.elasticsearch.hadoop.EsHadoopIllegalStateException: Field 'positions.start_date' not found; typically this occurs with arrays which are not mapped as single value
at org.elasticsearch.spark.sql.RowValueReader$class.rowColumns(RowValueReader.scala:34)
at org.elasticsearch.spark.sql.ScalaRowValueReader.rowColumns(ScalaEsRowValueReader.scala:14)
at org.elasticsearch.spark.sql.ScalaRowValueReader.createMap(ScalaEsRowValueReader.scala:42)
at org.elasticsearch.hadoop.serialization.ScrollReader.map(ScrollReader.java:766)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:692)
at org.elasticsearch.hadoop.serialization.ScrollReader.list(ScrollReader.java:734)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:687)
at org.elasticsearch.hadoop.serialization.ScrollReader.map(ScrollReader.java:794)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:692)
at org.elasticsearch.hadoop.serialization.ScrollReader.readHitAsMap(ScrollReader.java:457)
at org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:382)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:277)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:250)
at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:456)
at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:86)
at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:43)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


(Costin Leau) #6

Repeating myself, what is your mapping in ES? What does the docs look like? Do they contain arrays (as the exception message indicates)?


(system) #7