I am planning to store events in elastic search. It can have around 100 million events at any point time. To de-dupe events, I am planning to create _id column of length 100 chars by concatenating below fields
entity_id - UUID (37 chars) +
event_creation_time (30 chars) +
event_type (30 chars)
This store will be having normal reads & writes along with aggregate queries (no updates / deletes)
Can you please let me know if there would be any performance impact or any other side-effects of using such lengthy string _id columns instead of default Ids.
A very long uid field is not great, especially if the uids can share a large common prefix: it will slow down the uid lookup that ES must do for every document you index or delete.
One alternative that might be worth exploring/benchmarking is to use a hash of entity_id+event_creation_time+event_type. Or, if keeping them sequential is helpful, you could do event_creation_time+hash(entity_id+event_type). Also, I'd imagine those 30 chars for the timestamp could be expressed more compactly.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.