My need is to insert records from hive to elasticsearch which was going fine for me. From past few days we are observing that few of the records get duplicated while inserted into elasticsearch.
I browsed about this problem and found out that one reason could be speculative execution present in hadoop. So I set following flags to false to disable that.
Changes in mapred-site.xml file
Changes in hive-site.xml
But even after doing this, i am still getting document duplicacy issue. Also my elasticsearch insertion query is very straight.
insert into select * from
As per my understanding only mapper will be involved in this query. In short ES has more document count than record count in hive table.
I am using AWS EMR cluster for hive and AWS ElasticSearch Service for elasticsearch.
Do you see any failures on the executors for writing to Elasticsearch? In the event of an executor failing, all data from that task is retried, which can lead to duplicates. If the data you are ingesting is sensitive to duplication, you could specify an ID for each record to ensure that it does not duplicate data if tasks must be retried.
Actually I am using AWS Elasticsearch, It throws gateway timeout error (504). If the operation takes more than 60 seconds to complete. So I am feeling some times this bulk insert takes more than 60 seconds, and in that case not all records are inserted. Since the elasticsearch-hadoop plugin always retries all the records again, it creates duplicates.
This issue is very particular to AWS ElasticSearch as you can not increase the gateway timeout from 60 seconds.
Yeah, in this situation your best bet is to ensure a unique ID is accompanied to each document you are indexing to ensure that duplicate writes are collapsed on retry.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.