Error when moving data from hive using elasticsearch-hadoop plugin to elasticsearch

Harbeer_Kadian · May 9, 2018, 8:19am

Hi all,

I raised following issue in the past.

Here the advised solution was to generate unique id for each record to prevent duplicates from getting inserted.
I am creating my primary key using md5 function. I take all the columns which creates uniqueness for the record and create its md5 as primary key.

Before doing this primary key fix, record count used to be higher in elasticsearch because of duplicate insertion
Now after the fix, record count is less in elasticsearch.
I am not able to find the reason.
Here is my ES Table properties.

TBLPROPERTIES('es.mapping.id' = 'id', 'es.nodes' = '%ES_NODES%',
'es.port' = '%ES_PORT%', 'es.index.auto.create' = 'false', 'es.batch.size.bytes' = '1mb', 'es.batch.size.entries' = '500', 'es.batch.write.retry.count' = '100',
'es.batch.write.retry.wait' = '60s', 'es.batch.write.refresh' = 'false','es.nodes.discovery' = 'false',
'es.nodes.client.only' = 'false', 'es.resource' = '%ES_RESOURCE%', 'es.query' = '?q=*', 'es.nodes.wan.only' = 'true')

Am i missing some property?

Harbeer

Harbeer_Kadian · May 12, 2018, 10:48am

Also here is the error, which i see in logs.

[HISTORY][DAG:dag_1525939456125_0003_1][Event:TASK_ATTEMPT_FINISHED]: vertexName=Map 1, taskAttemptId=atte mpt_1525939456125_0003_1_00_000002_0, creationTime=1525941957539, allocationTime=1525941958909, startTime=1525941963683, finishTime=1525942544544, timeTaken=580861, status=FAILED, taskFailureType=NO N_FATAL, errorEnum=FRAMEWORK_ERROR, diagnostics=Error: Error while running task ( failure ) : attempt_1525939456125_0003_1_00_000002_0:java.lang.RuntimeException: java.lang.RuntimeException: org.apa che.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row

at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

james.baiera · May 30, 2018, 7:33pm

Unfortunately, that error message does not help too much. Can you check the job task logs to see if there's anything else that might highlight a problem?

system · June 27, 2018, 7:37pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Duplicate documents get inserted when moving data from hive using elasticsearch-hadoop plugin to elasticsearch Elasticsearch es-hadoop	4	1720	January 11, 2018
Data duplicated in Elasticsearch when added from Hive - RESOLVED Elasticsearch es-hadoop	3	1157	August 23, 2018
Duplicate data on hadoop Elasticsearch	2	823	July 6, 2017
Data from hive table Elasticsearch	2	450	July 6, 2017
Issue of elasticsearch-hadoop-2.0.0 with Hive (cloudera and hortonworks), helps are needed Elasticsearch	4	589	July 6, 2017

Error when moving data from hive using elasticsearch-hadoop plugin to elasticsearch

Related topics