I am trying see if there is a way to set/override _id attribute while writing a dataframe into elastic-search. Not sure if I have a column named "_id" can help. I would like to set the "_id" field so that duplicate rows are not inserted into elastic search.
@Muthu_Jayakumar That is indeed the correct setting for defining an ID on a document by document basis. Make sure to have a unique ID for each document or if the ID's are not unique to de-duplicate and collapse the data per id before ingesting into Elasticsearch. Since Spark and Hadoop tasks are not run in a deterministic order, only the data within a task is correctly ordered, and you may see non deterministic write results if those previous steps are not taken.
I have taken the route of coming up _id attribute because the same spark dataframe, when seen from elastic search has duplicate rows. But during all these time, when I see the dataframe from spark side of things, it seems about right. Again, this issue happens for the exact same datasource-dataframe in 1 out of 10 times. Other times the data in elastic search looks fine.
I am using Elastic Search 5.2.1and Apache Spark 2.1.0.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.