We are implementing search on candidate data of about 10 mil records and we are planning to use candidate email ID as primary key for each record in elasticsearch.
Can we use emailAddress as _id in elasticsearch or we should only use elastic generated random id only?
Is there any performance impact if we use custom id?
Performance impact on indexing and searching?
Are we over loading elasticsearch by using custom id?
I think you should use the candidate email as an id.
Elasticsearch autogenerated ids are faster to index since Elasticsearch can safely assume that the generated ids do not exist in the index yet. However this usually makes sense only for append-only data such as log lines. In your case I suspect you will occasionally need to replace or update some documents, so predictable ids will make it easier.
Using auto-generated ids has no impact on search speed.
Fields are prefix-compressed, so if you want to optimize for disk space, you might want to put the domain name before the user id since I would expect many email addresses to share the same domain name.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.