Users of ES-Hadoop can specify one of their fields to be used as the document’s ID, and Elasticsearch manages ID based writes consistently. We can’t know ahead of time what your documents’ IDs are, so it’s up to each user to ensure their streaming data contains an ID of some sort.
Are there some best practices on how to generate the ID for time series data and to optimize for ingest? I'm starting by introducing UUID v4, but seeing that there are better implementation. Such as one and two.
I already understand that using auto-generated ID will skip duplicate check, thus saving lookup cost. Here I'm looking for an ID generation scheme for exactly-once guarantee and ingest performance.
@danielyahn your linked options pretty much hit the nail on the head as far as I can see. UUID v4 is a great way to ensure uniqueness of the ID without too much thinking about the implementation. Since it's so easy, I would suggest giving it a shot and seeing the performance implications before spending too much time on other ID strategies.
I've seen a few ID creation techniques across multiple projects. Making a compound ID out of other field values on a document in my experience provides a decent mix of ease of design and performance, though ensuring the uniqueness can be an issue if you don't know your data inside and out.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.