I am trying see if there is a way to set/override _id attribute while writing a dataframe into elastic-search. Not sure if I have a column named "_id" can help. I would like to set the "_id" field so that duplicate rows are not inserted into elastic search.

Not sure if (ref: would do the trick?
But if I have 2 rows with the same id field, would one row overwrite the other?

@Muthu_Jayakumar That is indeed the correct setting for defining an ID on a document by document basis. Make sure to have a unique ID for each document or if the ID's are not unique to de-duplicate and collapse the data per id before ingesting into Elasticsearch. Since Spark and Hadoop tasks are not run in a deterministic order, only the data within a task is correctly ordered, and you may see non deterministic write results if those previous steps are not taken.

I have taken the route of coming up _id attribute because the same spark dataframe, when seen from elastic search has duplicate rows. But during all these time, when I see the dataframe from spark side of things, it seems about right. Again, this issue happens for the exact same datasource-dataframe in 1 out of 10 times. Other times the data in elastic search looks fine.
I am using Elastic Search 5.2.1and Apache Spark 2.1.0.

Generally it is advised to set an id field since task failures can cause duplicate data to be inserted when they are retried.

