I have read that indexing is faster when no id is specified (elasticsearch does not have to check for duplicate).
However, is it relevant to index with a chosen _id if this document has to be retrieved multiple times in the future? Is it faster to get a document by _id rather than having an "id" field in the _source section ?
Thanks,
edit: I cannot add a tag like "performance" therefore I put that in the title
It's faster to GET a document directly by ID rather than having to search for it. Predominantly because we can go directly to the appropriate shard and lookup the document, whereas a search has to touch all the shards in parallel and lookup the term to find the document. It probably won't be exceptionally slow, but the get-by-ID should always be faster.
I wouldn't worry too much about the performance of autogenerated ID vs user-defined ID. There's a bit of a difference, but it isn't immense. I tell people that if you have a natural ID for a document... use that because it's likely you'll want to get-by-ID at some point. But if there's no natural ID, then go ahead and use the autogenerated version.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.