Exactly-once guarantee for Spark Structured Streaming

danielyahn · September 23, 2019, 7:49pm

The following blog that @james.baiera wrote says,

Users of ES-Hadoop can specify one of their fields to be used as the document’s ID, and Elasticsearch manages ID based writes consistently. We can’t know ahead of time what your documents’ IDs are, so it’s up to each user to ensure their streaming data contains an ID of some sort.

Are there some best practices on how to generate the ID for time series data and to optimize for ingest? I'm starting by introducing UUID v4, but seeing that there are better implementation. Such as one and two.

I already understand that using auto-generated ID will skip duplicate check, thus saving lookup cost. Here I'm looking for an ID generation scheme for exactly-once guarantee and ingest performance.

james.baiera · September 23, 2019, 8:26pm

@danielyahn your linked options pretty much hit the nail on the head as far as I can see. UUID v4 is a great way to ensure uniqueness of the ID without too much thinking about the implementation. Since it's so easy, I would suggest giving it a shot and seeing the performance implications before spending too much time on other ID strategies.

I've seen a few ID creation techniques across multiple projects. Making a compound ID out of other field values on a document in my experience provides a decent mix of ease of design and performance, though ensuring the uniqueness can be an issue if you don't know your data inside and out.

danielyahn · September 23, 2019, 9:07pm

Thanks @james.baiera for a quick reply.

I'm seeing 20% drop in performance with the UUID V4, which seems to be aligned with the benchmark shown here (25%).

Could you elaborate on ID creation techniques you've seen? I'm interested in learning what was done to improve the performance.

system · October 21, 2019, 9:07pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Performance concerns on using UUIDv4 generated ID Elasticsearch	6	2969	August 14, 2018
Best practice in generating document ID Elasticsearch	2	9800	July 6, 2017
Spark DataFrame -- Elastic Seach write _ID Elasticsearch es-hadoop	5	3088	April 9, 2017
What algorithm is ElasticSearch create Document _Id based on?Could somebody answer me，plz Elasticsearch	3	6690	February 28, 2019
Ingest pipeline: _id generation Elasticsearch	5	2191	October 19, 2018

Exactly-once guarantee for Spark Structured Streaming

Related topics