Performance concerns on using UUIDv4 generated ID

Ali_Nazemian · July 12, 2018, 2:01pm

I was wondering if there are any performance concerns on using UUIDv4 generated ID at the client side? I've got an application that sends documents with a randomly generated id. I've noticed the ID is generated by using Java UUID.randomUUID(). I have remembered there were some performance concerns with using this type of ID with Lucene before. I would like to understand if those concerns still exist or not.

Christian_Dahlqvist · July 12, 2018, 2:07pm

The identifier you choose can have a significant impact on performance, especially as shards get larger. I assume you have read this blog post about it?

What type of data are you indexing? Is each event associated with a timestamp, and are these ingested in near real time?

Ali_Nazemian · July 17, 2018, 12:19am

It's event log. Hence, it is associated with a timestamp. Yes, it is ingested in near real time from devices and my application is indexing heavy, search medium/low with mixture of filter 80/90% filter/retrieval queries and 20/10 aggregations.

Christian_Dahlqvist · July 17, 2018, 5:50am

I did some experiments a long time ago and noticed that I did get a lot better and consistent indexing throughput if I prefixed the document IDs with epoch timestamps so the keys were largely generated in order. The indexing throughput did not drop off nearly as quickly as shards grew in size. Second precision was sufficient for this purpose. I used a hash of the event, but adding a prefix to a UUID should work as well. I even created an experimental Logstash plugin that combined timestamp with a hash value of the event to avoid duplicates.

As you are specifying an ID, each indexing operation will be an update as Elasticsearch must determine whether the document already exists or not, so it will always be slower than letting Elasticsearch automatically assign the ID.

Ali_Nazemian · July 17, 2018, 6:49am

I was thinking of using a well-known hash function such as Murmur on the raw event to generate an ID. It's being used in lots of distributed applications for a good distribution of ID. However, as you mention introducing an ID at the client side reduces the indexing throughput, no matter how good the id generating is. Using UUIDv4 probably is one of the worst options for that.

Christian_Dahlqvist · July 17, 2018, 6:56am

If you need to avoid duplicates, assigning an ID when the event is captured is a good option. This can be a hash or something like a UUID. You will also reduce the performance impact by making the IDs gradually increase over time as I outlined earlier. I recall the reason for this being that Elasticsearch knows the min and max key that is held in each segment, so if new data is sorted after older data, a lot of segments do not need to be searched, which speeds up the process.

system · August 14, 2018, 6:56am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Lucene hates UUID v4. A real issue or a myth? Elasticsearch	5	4903	July 5, 2017
Performance considerations on uid generation Elasticsearch	2	405	July 5, 2019
Performance implications of using mongo id as elastic _id Elasticsearch	4	976	June 27, 2018
Exactly-once guarantee for Spark Structured Streaming Elasticsearch es-hadoop	3	1342	October 21, 2019
Best practice in generating document ID Elasticsearch	2	9882	July 6, 2017

Performance concerns on using UUIDv4 generated ID

Related topics