Data duplicated in Elasticsearch when added from Hive - RESOLVED

Hi all,

I have a table in Hive and it has: 1,412,444 records
with following fields:
timestamp TIMESTAMP, clientid STRING, kitnumber STRING, sessionid STRING, sessionduration INT, countofsession INT, dayssincelastsession INT, totalengegedtime INT, pagevisible INT, pagehidden INT, eventvalue INT, totalevents INT, hits INT, eventcategory STRING, eventaction STRING, eventlabel STRING, vimeouploaddate STRING, usertype STRING, page STRING, pagedepth INT, pagetitle STRING, screenname STRING, landingpage STRING, landingscreenname STRING, secondpage STRING, exitpage STRING, regionisocode STRING, city STRING, latitude DOUBLE, longitude DOUBLE, serviceprovider STRING, devicecategory STRING, javaenabled STRING, operatingsystem STRING, operatingsystemversion STRING, screencolors STRING, screenresolution STRING, browser STRING, browsersize STRING, browserversion STRING, mobiledeviceinfo STRING, mobileinputselector STRING

When I upload the same data to an external Hive table linked to Elasticsearch,
The record count becomes: 2,290,171 and many duplicates are created.

Can anyone help me figure-out why is this happening?
Thanks in advance,
Kishore.

Are you specifying one of the fields to be an ID field when you write? Have you had any failed tasks during the write operations? It's possible that rescheduled tasks, or re-run jobs can add duplicate data to Elasticsearch since without an ID field, one will be generated for each record sent.

1 Like

Hey James,
Thanks for the reply. Sorry to update here. The issue got resolved.
Yes, there was no ID field.
How I fixed it:
*** use this command to load data into elasticsearch, 'order by 1' avoids duplicates in the elasticsearch ***

$ insert overwrite table elk select * from act order by 1;

This forces the HIVE to use only 1 reducer whereas in previous case 4 reducers were running and data was getting duplicated.

Thanks again James,
Kishore.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.