Using custom _id (from Kinesis)

shushu · June 6, 2016, 12:01pm

Hello,
Our app uses Spark job to read from Kinesis, and write to Elastic.
We got a situation in which the job failed to write to Elastic, but kept reading from Kinesis, which made us missing data.
Trying to figure out a solution, we thought about using the kinesis checkpoint, which is unique, as the elastic document _id.
This will mean, that on any failure, we could just rollback to know checkpoint, and restart the job, which will just overwrite existing documents (if happens).

What do you think ?
Is using custom id is a proper solution ?
What about performance of Elastic in this case ?

Thanks,
Shushu

warkolm · June 6, 2016, 11:16pm

This could work, but why not handle the failure in your code a bit better?

shushu · June 7, 2016, 6:52am

Making our code is certainly true, and we keep making it better all the time.
The problem is - stability is still shaky, and if we must keep 100% of the data streaming properly, we can't say "no, our job will never break".
It is software - it will break for one reason or the other. The question is - how do we revive from the failure.
So, when you say "this could work" - what does it mean ?

Topic		Replies	Views
Spark DataFrame -- Elastic Seach write _ID Elasticsearch es-hadoop	5	3175	April 9, 2017
Mapping es.mapping.id Elasticsearch	11	640	June 9, 2021
Define custom ID to a document with saveJsonToES() Elasticsearch es-hadoop	1	2698	January 16, 2018
[hadoop] Pipelining Hadoop/Spark with ElasticSearch Elasticsearch	9	453	July 6, 2017
Reading and writing the same document too fast --> data loss Elasticsearch	5	597	July 6, 2017

Using custom _id (from Kinesis)

Related topics