Background: we've been working on a new ES install, happily bulk indexing data from a DB, and then from an upstream JSON API. The documents all have integer IDs, which makes POST/GET updates simple.
Last week, the upstream JSON Api people decided to change the PK/ID for their document resources to UUID v4, not binary but text forms. This type of UUID is "truly" random. Our Elastic system, downstream can adapt, but there are open questions.
According to what I heard at Elastic Ops training, this type of ID potentially has negative performance implications for the underlying Lucene indexes that make up the "shards" in an Elastic index.
Googling around brings up one post, where the explanation has to do with how Lucene can exploit the high bits of integers to optimize disk seeks, which it can't do when the ID type is a UUID text.
The disk seek part we don't really care about because we're definitely running our Elastic on SSDs, with fairly high IOPs.
The open question is then:
- does the use of random IDs like UUID v4 slow down Lucene when it's consolidating segments?
- do they have a general negative impact on performance during search processing, either as single-doc GETs or in more complex queries?
What's the real story? Many people want to know!
Thanks in advance!