Data duplicated in Elasticsearch when added from Hive - RESOLVED

hsk04 · July 17, 2018, 9:47pm

Hi all,

I have a table in Hive and it has: 1,412,444 records
with following fields:
timestamp TIMESTAMP, clientid STRING, kitnumber STRING, sessionid STRING, sessionduration INT, countofsession INT, dayssincelastsession INT, totalengegedtime INT, pagevisible INT, pagehidden INT, eventvalue INT, totalevents INT, hits INT, eventcategory STRING, eventaction STRING, eventlabel STRING, vimeouploaddate STRING, usertype STRING, page STRING, pagedepth INT, pagetitle STRING, screenname STRING, landingpage STRING, landingscreenname STRING, secondpage STRING, exitpage STRING, regionisocode STRING, city STRING, latitude DOUBLE, longitude DOUBLE, serviceprovider STRING, devicecategory STRING, javaenabled STRING, operatingsystem STRING, operatingsystemversion STRING, screencolors STRING, screenresolution STRING, browser STRING, browsersize STRING, browserversion STRING, mobiledeviceinfo STRING, mobileinputselector STRING

When I upload the same data to an external Hive table linked to Elasticsearch,
The record count becomes: 2,290,171 and many duplicates are created.

Can anyone help me figure-out why is this happening?
Thanks in advance,
Kishore.

james.baiera · July 25, 2018, 2:50pm

Are you specifying one of the fields to be an ID field when you write? Have you had any failed tasks during the write operations? It's possible that rescheduled tasks, or re-run jobs can add duplicate data to Elasticsearch since without an ID field, one will be generated for each record sent.

hsk04 · July 26, 2018, 10:44pm

Hey James,
Thanks for the reply. Sorry to update here. The issue got resolved.
Yes, there was no ID field.
How I fixed it:
*** use this command to load data into elasticsearch, 'order by 1' avoids duplicates in the elasticsearch ***

$ insert overwrite table elk select * from act order by 1;

This forces the HIVE to use only 1 reducer whereas in previous case 4 reducers were running and data was getting duplicated.

Thanks again James,
Kishore.

system · August 23, 2018, 10:44pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Error when moving data from hive using elasticsearch-hadoop plugin to elasticsearch Elasticsearch es-hadoop	3	947	June 27, 2018
Duplicate data on hadoop Elasticsearch	2	813	July 6, 2017
Duplicate documents get inserted when moving data from hive using elasticsearch-hadoop plugin to elasticsearch Elasticsearch es-hadoop	4	1696	January 11, 2018
Elastic Search Does not overwrite data from Hive overwrite insert statement Elasticsearch	1	470	April 10, 2018
Getting _id field in elasticsearch to map to a field in HIVE Elasticsearch	4	1906	November 4, 2022

Data duplicated in Elasticsearch when added from Hive - RESOLVED

Related topics