Terms aggregation by long faster than by string?


(Dor Rotman) #1

Hello,
I have an ID field with very high cardinality, currently implemented as a string, containing content similar to a GUID.
I wish to perform terms aggregations on a large data, and want to optimize this.

I read this article that discusses ordinals and was wondering:
If I change the field implementation to a long, would that help in terms of query speed / memory usage / anything?

Thanks.


(Boaz Leskes) #2

Internally strings and numbers are treated as bytes. When matters is how the bytes are distributed. Numbers also have the "down side" of being chopped to multiple terms to speed up range searches (see https://www.elastic.co/guide/en/elasticsearch/reference/2.1/precision-step.html ). In general GUIDs are fine, but check this blog for advice on how to optimize them: http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html


(Dor Rotman) #3

Hi Boaz, thanks for the info.
I will look into the formatting of whatever type I choose. I see precision_step is only for Elasticsearch 2.0+. Are there any recommendations for v1.7?

Also, I'm still wondering about this (from the link I posted):

Can switching to a numeric type help the performance of my query as well?

P.S. It's important to note I'm doing terms aggregation on a contextual ID field that is shared between multiple records (i.e. "session_id"), not on the unique document ID itself, if that matters.
Thanks.


(system) #4