Lucene hates UUID v4. A real issue or a myth?

Background: we've been working on a new ES install, happily bulk indexing data from a DB, and then from an upstream JSON API. The documents all have integer IDs, which makes POST/GET updates simple.

Last week, the upstream JSON Api people decided to change the PK/ID for their document resources to UUID v4, not binary but text forms. This type of UUID is "truly" random. Our Elastic system, downstream can adapt, but there are open questions.

According to what I heard at Elastic Ops training, this type of ID potentially has negative performance implications for the underlying Lucene indexes that make up the "shards" in an Elastic index.

Googling around brings up one post, where the explanation has to do with how Lucene can exploit the high bits of integers to optimize disk seeks, which it can't do when the ID type is a UUID text.

The disk seek part we don't really care about because we're definitely running our Elastic on SSDs, with fairly high IOPs.

The open question is then:

  • does the use of random IDs like UUID v4 slow down Lucene when it's consolidating segments?
  • do they have a general negative impact on performance during search processing, either as single-doc GETs or in more complex queries?

What's the real story? Many people want to know!

Thanks in advance!

2 Likes

In practice, there are lot more heavy exercise in Elasticsearch (indexing, search) than merely looking up segment doc IDs in Lucene, so you should not care about this too much. The effect is small to observe.

If you really care about Lucene and looking up IDs in segments, then yes, UUIDv4 IDs are more probably read from disk than from RAM, because they are long with 128 bits. Read more at Changing Bits: Choosing a fast unique identifier (UUID) for Lucene

And the last message is: you should really care about disk seek / disk read even with SSD. SSD is still a magnitude of 1000x slower than RAM. RAM is what counts.

My answers to your questions:

  • no, random IDs do not slow down Lucene. It's the bit length for fast prefix matching in block trees, not the randomness. UUIDv have 128 bit length and that is worse than a long.

  • no, there is no general negative impact of random IDs. You will observe in your tests that the tiny step of looking up IDs in Lucene is not the determining factor of the overall search/query/get performance, if you transport many docs over the network to a client.

Thank you Jörg for the quick reply.
I am familiar with that McCandless post, it's one of the few on the web to discuss the topic at length. What I gather from it is that it's not so much the seek per se, but the predictability of the ID that allows skipping segments altogether.

Also found this discussion on the ES github repo:

So two things follow from that discussion. Regular secure random UUIDs do have an impact on performance, according to the simple benchmark published in that thread. A factor of 10X slower, approximately.

Later on, a patch was merged that uses a time-based UUID v4, which provides a more predictable prefix for the auto-generated IDs that ES creates when POSTing new docs.

For us, the solution might be in a hybrid scenario: let ES generate its internal time-based UUIDs for documents, which will ultimately service Lucene's segments. Then store the upstream API's UUID in a separate, non-ID field for lookups that use it. Then use the internal ID when possible.

Note, the issue on ES was cause by the runtime of java.util.SecureRandom which is a different issue from using UUID.

Your scenario is that UUID are already given in your documents. Therefore, ther is nothing to bother about the creation of (auto-generated) IDs.

The predictability of IDs is given when they share a prefix. This is the case for sequences, or for short IDs. UUIDs do not share many prefixes, right.

But again, this is not time-consuming for the whole workload. There is no factor of 10x slower from ES API perspective because this is just a microbenchmark isolating ID generation(!) from the rest of the work that ES has to do.

I suggest to use the given UUID in the _id field and let ES do the work for you. Do not overengineer at the wrong end just by looking at _id field. Look at the total node performance at indexing time. Starting from ES 2.x, the segment merging process has many automatic adjustments which will influence the overall performance significantly.

Jörg, again many thanks and appreciation to you for shedding light on this.

Yes, the UUID is generated by upstream, by a Python API, so I wasn't concerned with the cost of generating them. Presumably that's being amortized by the HW scaling factor there. And also correct of course, that there isn't a discernible or predictable prefix.

Fortunately, we include extensive metrics in everything we do (using statsD, graphite, grafana), and have been capturing good baseline numbers on all main aspects, indexing, searching, etc. If this move to UUIDs has a material impact, we should be able to quantify it.

What you're saying makes sense, that as a fraction of the work the whole ES stack has to do, ID generation and management is a small %. We'll measure it and report back here. Our benchmarks don't isolate like that github post, so it should be more realistic.

I was thinking that a (truly) random, base 30-48 binary UUID would work great, if the impact was worth it.

If we're able to adopt ES 2.x within our timeline, it sounds like it might be a better idea.