Hello, I have a question the is bother me. I have used the Elastic Spark to create a dataframe for one of my indexes. The schema prints fine (see below). However whenever I try to do anything with the dataframe I get the exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 34.0 failed 1 times, most recent failure: Lost task 0.0 in stage 34.0 (TID 108, localhost, executor driver): org.elasticsearch.hadoop.EsHadoopIllegalStateException: Position for 'oedocumentrecordset.md5sum' not found in row; typically this is caused by a mapping inconsistency
The field mentioned varies. If I just try to do a count I get a different name. This index works perfectly find in Kibana and access via curl commands and via other code. I found a bug listed on GitHub about this but it is marked as fixed. Same issue if I use SPARK SQL to read the dataframe.
Here is my code snippet:
val esdf = spark.read.format("es").options(esParams).load("coalesce-oedocument")
esdf.createOrReplaceTempView("esdocs")esdf.printSchema()
esdf.groupBy("oedocumentrecordset.documentsource").count().show()
Here is the schema and accompanying stack trace.
root
|-- coalesceentity: struct (nullable = true)
| |-- datecreated: timestamp (nullable = true)
| |-- entityid: string (nullable = true)
| |-- entityidtype: string (nullable = true)
| |-- lastmodified: timestamp (nullable = true)
| |-- name: string (nullable = true)
| |-- objectkey: string (nullable = true)
| |-- source: string (nullable = true)
| |-- status: string (nullable = true)
| |-- title: string (nullable = true)
| |-- version: string (nullable = true)
|-- oedocumentrecordset: struct (nullable = true)
| |-- categories: string (nullable = true)
| |-- content: string (nullable = true)
| |-- contentlength: long (nullable = true)
| |-- datasource: string (nullable = true)
| |-- dateingested: timestamp (nullable = true)
| |-- documentdate: timestamp (nullable = true)
| |-- documentlastmodifieddate: timestamp (nullable = true)
| |-- documentsource: string (nullable = true)
| |-- documenttitle: string (nullable = true)
| |-- documenttype: string (nullable = true)
| |-- issimulation: boolean (nullable = true)
| |-- md5sum: string (nullable = true)
| |-- ner_date: string (nullable = true)
| |-- ner_location: string (nullable = true)
| |-- ner_money: string (nullable = true)
| |-- ner_organization: string (nullable = true)
| |-- ner_percent: string (nullable = true)
| |-- ner_person: string (nullable = true)
| |-- ner_time: string (nullable = true)
| |-- ontologyreference: string (nullable = true)
| |-- pmesiipteconomic: float (nullable = true)
| |-- pmesiiptinformation: float (nullable = true)
| |-- pmesiiptinfrastructure: float (nullable = true)
| |-- pmesiiptmilitary: float (nullable = true)
| |-- pmesiiptphysicalenvironment: float (nullable = true)
| |-- pmesiiptpolitical: float (nullable = true)
| |-- pmesiiptsocial: float (nullable = true)
| |-- pmesiipttime: float (nullable = true)
| |-- sourceuri: string (nullable = true)
| |-- tags: string (nullable = true)
| |-- wordcount: long (nullable = true)org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 34.0 failed 1 times, most recent failure: Lost task 0.0 in stage 34.0 (TID 108, localhost, executor driver): org.elasticsearch.hadoop.EsHadoopIllegalStateException: Position for 'oedocumentrecordset.md5sum' not found in row; typically this is caused by a mapping inconsistency
at org.elasticsearch.spark.sql.RowValueReader$class.addToBuffer(RowValueReader.scala:60)
at org.elasticsearch.spark.sql.ScalaRowValueReader.addToBuffer(ScalaEsRowValueReader.scala:32)
at org.elasticsearch.spark.sql.ScalaRowValueReader.addToMap(ScalaEsRowValueReader.scala:118)
at org.elasticsearch.hadoop.serialization.ScrollReader.map(ScrollReader.java:810)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:700)
at org.elasticsearch.hadoop.serialization.ScrollReader.readHitAsMap(ScrollReader.java:466)
at org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:391)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:286)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:259)
at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:365)
at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:92)
at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:61)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)