Duplicate documents get inserted when moving data from hive using elasticsearch-hadoop plugin to elasticsearch

Harbeer_Kadian · December 1, 2017, 8:48am

My need is to insert records from hive to elasticsearch which was going fine for me. From past few days we are observing that few of the records get duplicated while inserted into elasticsearch.

I browsed about this problem and found out that one reason could be speculative execution present in hadoop. So I set following flags to false to disable that.

Changes in mapred-site.xml file
mapred.reduce.tasks.speculative.execution false
mapred.map.tasks.speculative.execution false

Changes in hive-site.xml
hive.mapred.reduce.tasks.speculative.execution false
But even after doing this, i am still getting document duplicacy issue. Also my elasticsearch insertion query is very straight.

insert into select * from
As per my understanding only mapper will be involved in this query. In short ES has more document count than record count in hive table.

I am using AWS EMR cluster for hive and AWS ElasticSearch Service for elasticsearch.

james.baiera · December 13, 2017, 7:35pm

Do you see any failures on the executors for writing to Elasticsearch? In the event of an executor failing, all data from that task is retried, which can lead to duplicates. If the data you are ingesting is sensitive to duplication, you could specify an ID for each record to ensure that it does not duplicate data if tasks must be retried.

Harbeer_Kadian · December 14, 2017, 8:39am

Actually I am using AWS Elasticsearch, It throws gateway timeout error (504). If the operation takes more than 60 seconds to complete. So I am feeling some times this bulk insert takes more than 60 seconds, and in that case not all records are inserted. Since the elasticsearch-hadoop plugin always retries all the records again, it creates duplicates.
This issue is very particular to AWS ElasticSearch as you can not increase the gateway timeout from 60 seconds.

james.baiera · December 14, 2017, 7:37pm

Yeah, in this situation your best bet is to ensure a unique ID is accompanied to each document you are indexing to ensure that duplicate writes are collapsed on retry.

system · January 11, 2018, 7:37pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Error when moving data from hive using elasticsearch-hadoop plugin to elasticsearch Elasticsearch es-hadoop	3	947	June 27, 2018
Duplicate insertions for a single document Elasticsearch es-hadoop	4	731	July 6, 2017
Data duplicated in Elasticsearch when added from Hive - RESOLVED Elasticsearch es-hadoop	3	1140	August 23, 2018
Duplicate data on hadoop Elasticsearch	2	813	July 6, 2017
Dropping Documents on Hive Import with elasticsearch-hadoop-5.4.0 Elasticsearch es-hadoop	3	758	June 28, 2017

Duplicate documents get inserted when moving data from hive using elasticsearch-hadoop plugin to elasticsearch

Related topics