Elasticsearch-spark - EsHadoopIllegalStateException - field position

bytemedwb · March 9, 2019, 12:21am

Hello, I have a question the is bother me. I have used the Elastic Spark to create a dataframe for one of my indexes. The schema prints fine (see below). However whenever I try to do anything with the dataframe I get the exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 34.0 failed 1 times, most recent failure: Lost task 0.0 in stage 34.0 (TID 108, localhost, executor driver): org.elasticsearch.hadoop.EsHadoopIllegalStateException: Position for 'oedocumentrecordset.md5sum' not found in row; typically this is caused by a mapping inconsistency

The field mentioned varies. If I just try to do a count I get a different name. This index works perfectly find in Kibana and access via curl commands and via other code. I found a bug listed on GitHub about this but it is marked as fixed. Same issue if I use SPARK SQL to read the dataframe.

Here is my code snippet:

val esdf = spark.read.format("es").options(esParams).load("coalesce-oedocument")
esdf.createOrReplaceTempView("esdocs")

esdf.printSchema()

esdf.groupBy("oedocumentrecordset.documentsource").count().show()

Here is the schema and accompanying stack trace.

root
|-- coalesceentity: struct (nullable = true)
| |-- datecreated: timestamp (nullable = true)
| |-- entityid: string (nullable = true)
| |-- entityidtype: string (nullable = true)
| |-- lastmodified: timestamp (nullable = true)
| |-- name: string (nullable = true)
| |-- objectkey: string (nullable = true)
| |-- source: string (nullable = true)
| |-- status: string (nullable = true)
| |-- title: string (nullable = true)
| |-- version: string (nullable = true)
|-- oedocumentrecordset: struct (nullable = true)
| |-- categories: string (nullable = true)
| |-- content: string (nullable = true)
| |-- contentlength: long (nullable = true)
| |-- datasource: string (nullable = true)
| |-- dateingested: timestamp (nullable = true)
| |-- documentdate: timestamp (nullable = true)
| |-- documentlastmodifieddate: timestamp (nullable = true)
| |-- documentsource: string (nullable = true)
| |-- documenttitle: string (nullable = true)
| |-- documenttype: string (nullable = true)
| |-- issimulation: boolean (nullable = true)
| |-- md5sum: string (nullable = true)
| |-- ner_date: string (nullable = true)
| |-- ner_location: string (nullable = true)
| |-- ner_money: string (nullable = true)
| |-- ner_organization: string (nullable = true)
| |-- ner_percent: string (nullable = true)
| |-- ner_person: string (nullable = true)
| |-- ner_time: string (nullable = true)
| |-- ontologyreference: string (nullable = true)
| |-- pmesiipteconomic: float (nullable = true)
| |-- pmesiiptinformation: float (nullable = true)
| |-- pmesiiptinfrastructure: float (nullable = true)
| |-- pmesiiptmilitary: float (nullable = true)
| |-- pmesiiptphysicalenvironment: float (nullable = true)
| |-- pmesiiptpolitical: float (nullable = true)
| |-- pmesiiptsocial: float (nullable = true)
| |-- pmesiipttime: float (nullable = true)
| |-- sourceuri: string (nullable = true)
| |-- tags: string (nullable = true)
| |-- wordcount: long (nullable = true)

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 34.0 failed 1 times, most recent failure: Lost task 0.0 in stage 34.0 (TID 108, localhost, executor driver): org.elasticsearch.hadoop.EsHadoopIllegalStateException: Position for 'oedocumentrecordset.md5sum' not found in row; typically this is caused by a mapping inconsistency
at org.elasticsearch.spark.sql.RowValueReader$class.addToBuffer(RowValueReader.scala:60)
at org.elasticsearch.spark.sql.ScalaRowValueReader.addToBuffer(ScalaEsRowValueReader.scala:32)
at org.elasticsearch.spark.sql.ScalaRowValueReader.addToMap(ScalaEsRowValueReader.scala:118)
at org.elasticsearch.hadoop.serialization.ScrollReader.map(ScrollReader.java:810)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:700)
at org.elasticsearch.hadoop.serialization.ScrollReader.readHitAsMap(ScrollReader.java:466)
at org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:391)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:286)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:259)
at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:365)
at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:92)
at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:61)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)

bytemedwb · March 22, 2019, 8:40pm

All, so I have updated to the latest version: elasticsearch-spark-20_2.11-6.6.2.jar

Same issue.
Tried this with several indexes - same issues.

Looking for any help here.

traam · March 29, 2019, 11:18pm

My guess is, not all documents in the index is having all fields as mentioned in the mapping. Based on the mapping, the dataframe is expecting these fields to be present in every document as it is reading from the index. The error will show up every time you do an operation/action on the dataframe. The reading/loading itself will succeed as it is a lazy operation.

james.baiera · April 5, 2019, 8:45pm

@bytemedwb Are you able to reproduce this with a simpler mapping? Could you provide a test document and mapping (fake data), and some steps that would reproduce this error?

Additionally, Are you using dots in your field names, or is the data using an object structure? Right now there are some issues with supporting fields with dots in their names due to how the source document structure differs from the mappings.

system · May 3, 2019, 8:45pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Error load as a DataFrame Elasticsearch es-hadoop	6	1766	July 6, 2017
org.elasticsearch.hadoop.EsHadoopIllegalStateException: Position for field' not found in row Elasticsearch es-hadoop	2	676	July 26, 2021
org.elasticsearch.hadoop.rest.EsHadoopParsingException: org.elasticsearch.hadoop.EsHadoopIllegalStateException: Position for 'targetTableCount.site_xref: ' not found in row; typically this is caused by a mapping inconsistency Elasticsearch es-hadoop	1	857	March 11, 2021
ElasticSearch+Hadoop+Spark Elasticsearch	2	979	July 6, 2017
Cannot read from Elasticsearch using Spark SQL Elasticsearch	4	1300	July 5, 2017

Elasticsearch-spark - EsHadoopIllegalStateException - field position

Related topics