HIVE-Elasticsearch Error Handler for Malformed records


We are trying to ingest data from Hive to Elasticsearch, Had issues ingesting the data and few of them are malformed json records

Followed below documentation for handling bad records and created DDL script to ingest data

Using Elastic search version 6.8.0

hive> select * from provider1;
Time taken: 0.179 seconds, Fetched: 14 row(s)

ADD JAR /home/smrafi/elasticsearch-hadoop-6.8.0/dist/elasticsearch-hadoop-6.8.0.jar;
CREATE external TABLE hive_es_with_handler( data STRING)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'test_eshadoop/healthCareProvider','es.nodes' = 'xyz','es.input.json' = 'yes','' = 'true','es.write.operation'='upsert',
'es.nodes.wan.only' = 'true','es.port' = '443',''='true','es.batch.size.entries'='1','' ='id','es.batch.write.retry.count'='-1',
'' = 'es, ignoreBadRecords',
'' = 'customLog',
'' = '',
'' = 'BulkErrors',
'' = 'SerializationErrors',
'' = 'com.verisys.elshandler.IgnoreBadRecordHandler',
insert into hive_es_with_handler10 select * from provider1;

Below is exception trace, it failed complaining the error.handler index is not present

Caused by: org.elasticsearch.hadoop.serialization.EsHadoopSerializationException: org.codehaus.jackson.JsonParseException: Unexpected character (',' (code 44)): was expecting a colon to separate field name and value
 at [Source: [B@1e3f0aea; line: 1, column: 7]
	at org.elasticsearch.hadoop.serialization.json.JacksonJsonParser.nextToken(
	at org.elasticsearch.hadoop.serialization.ParsingUtils.doFind(
	at org.elasticsearch.hadoop.serialization.ParsingUtils.values(
	at org.elasticsearch.hadoop.serialization.field.JsonFieldExtractors.process(
	at org.elasticsearch.hadoop.serialization.bulk.JsonTemplatedBulk.preProcess(
	at org.elasticsearch.hadoop.serialization.bulk.TemplatedBulk.write(
	at org.elasticsearch.hadoop.hive.EsSerDe.serialize(
	at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(
	at org.apache.hadoop.hive.ql.exec.Operator.forward(
	at org.apache.hadoop.hive.ql.exec.SelectOperator.process(
	at org.apache.hadoop.hive.ql.exec.Operator.forward(
	at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(
	at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(
	at org.apache.hadoop.hive.ql.exec.MapOperator.process(
	... 9 more
Caused by: org.codehaus.jackson.JsonParseException: Unexpected character (',' (code 44)): was expecting a colon to separate field name and value
 at [Source: [B@1e3f0aea; line: 1, column: 7]
	at org.codehaus.jackson.JsonParser._constructError(
	at org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(
	at org.codehaus.jackson.impl.JsonParserMinimalBase._reportUnexpectedChar(
	at org.codehaus.jackson.impl.Utf8StreamParser.nextToken(
	at org.elasticsearch.hadoop.serialization.json.JacksonJsonParser.nextToken(
	... 22 more

I tried to use the custom SerializationErrorHandler But it is of no use and Handler is not coming into context, Its completely stopping the job instead of continuing for the good records even After having default (HANDLED as the constant)

It looks like you are using the aws provided Elasticsearch, you will need to ask them about this as it's a fork that they run and we do not know the changes they have made to the product.

Thanks Mark, Have seen the documentation, it says as below

Serialization Error Handlers

Before sending data to Elasticsearch, elasticsearch-hadoop must serialize each document into a JSON bulk entry. It is during this process that the bulk operation is determined, document metadata is extracted, and integration specific data structures are converted into JSON documents. During this process, inconsistencies with record structure can cause exceptions to be thrown during the serialization process. These errors often lead to failed tasks and halted processing.

Elasticsearch for Apache Hadoop provides an API to handle serialization errors at the record level. Error handlers for serialization are given:

  • The integration specific data structure that was unable to be serialized
  • Exception encountered during serialization

Serialization Error Handlers are not yet available for Hive. Elasticsearch for Apache Hadoop uses Hive’s SerDe constructs to convert data into bulk entries before being sent to the output format. SerDe objects do not have a cleanup method that is called when the object ends its lifecycle. Because of this, we do not support serialization error handlers in Hive as they cannot be closed at the end of the job execution.

And it was working for the wrong types inside document, those errors were properly caught with custom error handlers, I was assuming the malformed json document will be treated the same, checked it was undergoing some pre-process (extracting fields)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.