Exactly-once guarantee for Spark Structured Streaming

The following blog that @james.baiera wrote says,

Users of ES-Hadoop can specify one of their fields to be used as the document’s ID, and Elasticsearch manages ID based writes consistently. We can’t know ahead of time what your documents’ IDs are, so it’s up to each user to ensure their streaming data contains an ID of some sort.

Are there some best practices on how to generate the ID for time series data and to optimize for ingest? I'm starting by introducing UUID v4, but seeing that there are better implementation. Such as one and two.

I already understand that using auto-generated ID will skip duplicate check, thus saving lookup cost. Here I'm looking for an ID generation scheme for exactly-once guarantee and ingest performance.

@danielyahn your linked options pretty much hit the nail on the head as far as I can see. UUID v4 is a great way to ensure uniqueness of the ID without too much thinking about the implementation. Since it's so easy, I would suggest giving it a shot and seeing the performance implications before spending too much time on other ID strategies.

I've seen a few ID creation techniques across multiple projects. Making a compound ID out of other field values on a document in my experience provides a decent mix of ease of design and performance, though ensuring the uniqueness can be an issue if you don't know your data inside and out.

Thanks @james.baiera for a quick reply.

I'm seeing 20% drop in performance with the UUID V4, which seems to be aligned with the benchmark shown here (25%).

Could you elaborate on ID creation techniques you've seen? I'm interested in learning what was done to improve the performance.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.