Spark DataFrame -- Elastic Seach write _ID

Muthu_Jayakumar · February 26, 2017, 5:46pm

Hello there,

I am trying see if there is a way to set/override _id attribute while writing a dataframe into elastic-search. Not sure if I have a column named "_id" can help. I would like to set the "_id" field so that duplicate rows are not inserted into elastic search.

Please advice,
Muthu

Muthu_Jayakumar · February 26, 2017, 6:02pm

Not sure if es.mapping.id (ref: https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html) would do the trick?
But if I have 2 rows with the same id field, would one row overwrite the other?

james.baiera · March 11, 2017, 8:05pm

@Muthu_Jayakumar That is indeed the correct setting for defining an ID on a document by document basis. Make sure to have a unique ID for each document or if the ID's are not unique to de-duplicate and collapse the data per id before ingesting into Elasticsearch. Since Spark and Hadoop tasks are not run in a deterministic order, only the data within a task is correctly ordered, and you may see non deterministic write results if those previous steps are not taken.

Muthu_Jayakumar · March 12, 2017, 7:28pm

Hello James Baiera,

I have taken the route of coming up _id attribute because the same spark dataframe, when seen from elastic search has duplicate rows. But during all these time, when I see the dataframe from spark side of things, it seems about right. Again, this issue happens for the exact same datasource-dataframe in 1 out of 10 times. Other times the data in elastic search looks fine.
I am using Elastic Search 5.2.1and Apache Spark 2.1.0.

Please advice,
Muthu

james.baiera · March 12, 2017, 8:10pm

Generally it is advised to set an id field since task failures can cause duplicate data to be inserted when they are retried.

system · April 9, 2017, 8:11pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Duplicate rows Elasticsearch es-hadoop	4	2298	March 27, 2017
How to update documents using spark Elasticsearch es-hadoop	2	1548	December 10, 2016
Is there a way to "update" ES records using Spark? Elasticsearch	4	1180	July 6, 2017
Generating custom _id when exporting from hadoop/spark to ES Elasticsearch es-hadoop	2	401	July 28, 2022
Loading JSON documents to elasticsearch via es-spark connector Elasticsearch es-hadoop	5	1531	August 22, 2018

Spark DataFrame -- Elastic Seach write _ID

Related topics