Spark DataFrame -- Elastic Seach write _ID

Hello there,

I am trying see if there is a way to set/override _id attribute while writing a dataframe into elastic-search. Not sure if I have a column named "_id" can help. I would like to set the "_id" field so that duplicate rows are not inserted into elastic search.

Please advice,
Muthu

Not sure if es.mapping.id (ref: https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html) would do the trick?
But if I have 2 rows with the same id field, would one row overwrite the other?

@Muthu_Jayakumar That is indeed the correct setting for defining an ID on a document by document basis. Make sure to have a unique ID for each document or if the ID's are not unique to de-duplicate and collapse the data per id before ingesting into Elasticsearch. Since Spark and Hadoop tasks are not run in a deterministic order, only the data within a task is correctly ordered, and you may see non deterministic write results if those previous steps are not taken.

Hello James Baiera,

I have taken the route of coming up _id attribute because the same spark dataframe, when seen from elastic search has duplicate rows. But during all these time, when I see the dataframe from spark side of things, it seems about right. Again, this issue happens for the exact same datasource-dataframe in 1 out of 10 times. Other times the data in elastic search looks fine.
I am using Elastic Search 5.2.1and Apache Spark 2.1.0.

Please advice,
Muthu

Generally it is advised to set an id field since task failures can cause duplicate data to be inserted when they are retried.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.