Performance impact of using a string of length 100 characters as _Id column in Elastic Search


(Harish Kommaraju) #1

Hi,

I am planning to store events in elastic search. It can have around 100 million events at any point time. To de-dupe events, I am planning to create _id column of length 100 chars by concatenating below fields
entity_id - UUID (37 chars) +
event_creation_time (30 chars) +
event_type (30 chars)

This store will be having normal reads & writes along with aggregate queries (no updates / deletes)
Can you please let me know if there would be any performance impact or any other side-effects of using such lengthy string _id columns instead of default Ids.

Thanks,
Harish


(Mark Walkom) #2

This will help http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html


(Michael McCandless) #3

A very long uid field is not great, especially if the uids can share a large common prefix: it will slow down the uid lookup that ES must do for every document you index or delete.


(Harish Kommaraju) #4

Thanks for your replies. Then is there any way of handling de-dupes on multiple columns during inserts?


(Loren Siebert) #5

One alternative that might be worth exploring/benchmarking is to use a hash of entity_id+event_creation_time+event_type. Or, if keeping them sequential is helpful, you could do event_creation_time+hash(entity_id+event_type). Also, I'd imagine those 30 chars for the timestamp could be expressed more compactly.


(system) #6