I was wondering if there are any performance concerns on using UUIDv4 generated ID at the client side? I've got an application that sends documents with a randomly generated id. I've noticed the ID is generated by using Java UUID.randomUUID(). I have remembered there were some performance concerns with using this type of ID with Lucene before. I would like to understand if those concerns still exist or not.
The identifier you choose can have a significant impact on performance, especially as shards get larger. I assume you have read this blog post about it?
What type of data are you indexing? Is each event associated with a timestamp, and are these ingested in near real time?
It's event log. Hence, it is associated with a timestamp. Yes, it is ingested in near real time from devices and my application is indexing heavy, search medium/low with mixture of filter 80/90% filter/retrieval queries and 20/10 aggregations.
I did some experiments a long time ago and noticed that I did get a lot better and consistent indexing throughput if I prefixed the document IDs with epoch timestamps so the keys were largely generated in order. The indexing throughput did not drop off nearly as quickly as shards grew in size. Second precision was sufficient for this purpose. I used a hash of the event, but adding a prefix to a UUID should work as well. I even created an experimental Logstash plugin that combined timestamp with a hash value of the event to avoid duplicates.
As you are specifying an ID, each indexing operation will be an update as Elasticsearch must determine whether the document already exists or not, so it will always be slower than letting Elasticsearch automatically assign the ID.
I was thinking of using a well-known hash function such as Murmur on the raw event to generate an ID. It's being used in lots of distributed applications for a good distribution of ID. However, as you mention introducing an ID at the client side reduces the indexing throughput, no matter how good the id generating is. Using UUIDv4 probably is one of the worst options for that.
If you need to avoid duplicates, assigning an ID when the event is captured is a good option. This can be a hash or something like a UUID. You will also reduce the performance impact by making the IDs gradually increase over time as I outlined earlier. I recall the reason for this being that Elasticsearch knows the min and max key that is held in each segment, so if new data is sorted after older data, a lot of segments do not need to be searched, which speeds up the process.