Greetings! I'm seeking a way to verify/clarify my understanding on what's going on during indexing a document with custom _id. I know that ES performs an existence check in such case, but I'm curious about details. For my understanding, ES does the following:
detetrmine shard (by hashing id or provided routing key) where document is potentially stored
perform a lookup within that shard
So, roughly my question is "there is no full index scan in such case, right"?
There is not even anything you could call a "full index scan" within the single shard that owns the document ID. It's just checking the terms dictionary -- think something like a B-tree -- so it's only logarithmic effort.
Note: High-throughput ingest use cases with self-generated IDs with larger shard sizes might negatively affect performance. While using your own _id provides flexibility, it could impact ingest efficiency.... so it is "not free" but as efficient as possible.
The "unusual" part of this is that you might get "uneven" ingest pattern when the shards are brand new ... little to no impact .... when large shards are reaching their limit the impact can become larger... then roll over happens and the pattern resets...
Yes, I understand that auto-generated ids are more efficient. In my case I have a system that generates ids on application side already and I wanted to check/verify my understanding before I do some optimization steps (we cannot switch to auto-generation quickly)
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.