Elasticsearch-spark - EsHadoopIllegalStateException - field position

Hello, I have a question the is bother me. I have used the Elastic Spark to create a dataframe for one of my indexes. The schema prints fine (see below). However whenever I try to do anything with the dataframe I get the exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 34.0 failed 1 times, most recent failure: Lost task 0.0 in stage 34.0 (TID 108, localhost, executor driver): org.elasticsearch.hadoop.EsHadoopIllegalStateException: Position for 'oedocumentrecordset.md5sum' not found in row; typically this is caused by a mapping inconsistency

The field mentioned varies. If I just try to do a count I get a different name. This index works perfectly find in Kibana and access via curl commands and via other code. I found a bug listed on GitHub about this but it is marked as fixed. Same issue if I use SPARK SQL to read the dataframe.

Here is my code snippet:

val esdf = spark.read.format("es").options(esParams).load("coalesce-oedocument")
esdf.createOrReplaceTempView("esdocs")

esdf.printSchema()

esdf.groupBy("oedocumentrecordset.documentsource").count().show()

Here is the schema and accompanying stack trace.

root
|-- coalesceentity: struct (nullable = true)
| |-- datecreated: timestamp (nullable = true)
| |-- entityid: string (nullable = true)
| |-- entityidtype: string (nullable = true)
| |-- lastmodified: timestamp (nullable = true)
| |-- name: string (nullable = true)
| |-- objectkey: string (nullable = true)
| |-- source: string (nullable = true)
| |-- status: string (nullable = true)
| |-- title: string (nullable = true)
| |-- version: string (nullable = true)
|-- oedocumentrecordset: struct (nullable = true)
| |-- categories: string (nullable = true)
| |-- content: string (nullable = true)
| |-- contentlength: long (nullable = true)
| |-- datasource: string (nullable = true)
| |-- dateingested: timestamp (nullable = true)
| |-- documentdate: timestamp (nullable = true)
| |-- documentlastmodifieddate: timestamp (nullable = true)
| |-- documentsource: string (nullable = true)
| |-- documenttitle: string (nullable = true)
| |-- documenttype: string (nullable = true)
| |-- issimulation: boolean (nullable = true)
| |-- md5sum: string (nullable = true)
| |-- ner_date: string (nullable = true)
| |-- ner_location: string (nullable = true)
| |-- ner_money: string (nullable = true)
| |-- ner_organization: string (nullable = true)
| |-- ner_percent: string (nullable = true)
| |-- ner_person: string (nullable = true)
| |-- ner_time: string (nullable = true)
| |-- ontologyreference: string (nullable = true)
| |-- pmesiipteconomic: float (nullable = true)
| |-- pmesiiptinformation: float (nullable = true)
| |-- pmesiiptinfrastructure: float (nullable = true)
| |-- pmesiiptmilitary: float (nullable = true)
| |-- pmesiiptphysicalenvironment: float (nullable = true)
| |-- pmesiiptpolitical: float (nullable = true)
| |-- pmesiiptsocial: float (nullable = true)
| |-- pmesiipttime: float (nullable = true)
| |-- sourceuri: string (nullable = true)
| |-- tags: string (nullable = true)
| |-- wordcount: long (nullable = true)

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 34.0 failed 1 times, most recent failure: Lost task 0.0 in stage 34.0 (TID 108, localhost, executor driver): org.elasticsearch.hadoop.EsHadoopIllegalStateException: Position for 'oedocumentrecordset.md5sum' not found in row; typically this is caused by a mapping inconsistency
at org.elasticsearch.spark.sql.RowValueReader$class.addToBuffer(RowValueReader.scala:60)
at org.elasticsearch.spark.sql.ScalaRowValueReader.addToBuffer(ScalaEsRowValueReader.scala:32)
at org.elasticsearch.spark.sql.ScalaRowValueReader.addToMap(ScalaEsRowValueReader.scala:118)
at org.elasticsearch.hadoop.serialization.ScrollReader.map(ScrollReader.java:810)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:700)
at org.elasticsearch.hadoop.serialization.ScrollReader.readHitAsMap(ScrollReader.java:466)
at org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:391)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:286)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:259)
at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:365)
at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:92)
at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:61)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)

All, so I have updated to the latest version: elasticsearch-spark-20_2.11-6.6.2.jar

Same issue.
Tried this with several indexes - same issues.

Looking for any help here.

My guess is, not all documents in the index is having all fields as mentioned in the mapping. Based on the mapping, the dataframe is expecting these fields to be present in every document as it is reading from the index. The error will show up every time you do an operation/action on the dataframe. The reading/loading itself will succeed as it is a lazy operation.

@bytemedwb Are you able to reproduce this with a simpler mapping? Could you provide a test document and mapping (fake data), and some steps that would reproduce this error?

Additionally, Are you using dots in your field names, or is the data using an object structure? Right now there are some issues with supporting fields with dots in their names due to how the source document structure differs from the mappings.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.