How are the keys sorted in ES?

mp2893 · March 27, 2012, 5:24pm

Hi,
I am rather new to ES, so this might be a very simple question.

I played with HBase for quite a while, and I know that HBase stores
keys lexicographically.
So generating keys in time series is not recommended. Because when a
query is based on a time range, this kind of key usage leads to users
accessing only a couple of nodes, since the keys are not distributed
out to all the nodes evenly.

I am worried that the same situation might happen with ES. I am
currently storing documents that has time-based id's (such as
20120327_23_20471029471097097)
Should I change the document id's so that documents will be stored
evenly across all the nodes?

Thanks,
Ed

Igor_Motov · March 27, 2012, 6:15pm

By default, elasticsearch is using hash of the document id. So, you should
be fine. In elasticsearch you can control this process by specifying
"routing" parameter with your request. See Routing section of this
page Elasticsearch Platform — Find real-time answers at scale | Elastic for more
details.

On Tuesday, March 27, 2012 1:24:39 PM UTC-4, mp2893 wrote:

Hi,
I am rather new to ES, so this might be a very simple question.

I played with HBase for quite a while, and I know that HBase stores
keys lexicographically.
So generating keys in time series is not recommended. Because when a
query is based on a time range, this kind of key usage leads to users
accessing only a couple of nodes, since the keys are not distributed
out to all the nodes evenly.

I am worried that the same situation might happen with ES. I am
currently storing documents that has time-based id's (such as
20120327_23_20471029471097097)
Should I change the document id's so that documents will be stored
evenly across all the nodes?

Thanks,
Ed

mp2893 · March 28, 2012, 1:50pm

Thanks Igor, for the quick reply.
Routing section was just what I needed to read.

Regards,
Ed

2012/3/28 Igor Motov imotov@gmail.com

By default, elasticsearch is using hash of the document id. So, you should
be fine. In elasticsearch you can control this process by specifying
"routing" parameter with your request. See Routing section of this page
Elasticsearch Platform — Find real-time answers at scale | Elastic for more
details.

On Tuesday, March 27, 2012 1:24:39 PM UTC-4, mp2893 wrote:

Hi,
I am rather new to ES, so this might be a very simple question.

I played with HBase for quite a while, and I know that HBase stores
keys lexicographically.
So generating keys in time series is not recommended. Because when a
query is based on a time range, this kind of key usage leads to users
accessing only a couple of nodes, since the keys are not distributed
out to all the nodes evenly.

I am worried that the same situation might happen with ES. I am
currently storing documents that has time-based id's (such as
20120327_23_20471029471097097)
Should I change the document id's so that documents will be stored
evenly across all the nodes?

Thanks,
Ed

mp2893 · March 28, 2012, 1:54pm

One more question,

Since ES is hashing the document id, is there a limit to the number of
documents I can store?
Or is ES using some kind of infinitely expanding hashing algorithm?

2012/3/28 edward choi mp2893@gmail.com

Thanks Igor, for the quick reply.
Routing section was just what I needed to read.

Regards,
Ed

2012/3/28 Igor Motov imotov@gmail.com

By default, elasticsearch is using hash of the document id. So, you
should be fine. In elasticsearch you can control this process by specifying
"routing" parameter with your request. See Routing section of this page
Elasticsearch Platform — Find real-time answers at scale | Elastic for more
details.

On Tuesday, March 27, 2012 1:24:39 PM UTC-4, mp2893 wrote:

Hi,
I am rather new to ES, so this might be a very simple question.

I played with HBase for quite a while, and I know that HBase stores
keys lexicographically.
So generating keys in time series is not recommended. Because when a
query is based on a time range, this kind of key usage leads to users
accessing only a couple of nodes, since the keys are not distributed
out to all the nodes evenly.

I am worried that the same situation might happen with ES. I am
currently storing documents that has time-based id's (such as
20120327_23_20471029471097097)
Should I change the document id's so that documents will be stored
evenly across all the nodes?

Thanks,
Ed

Igor_Motov · March 28, 2012, 2:22pm

Hash of id is only used to determine to which shard the document should be
assigned. During indexing, elasticsearch assigns each document to the shard
with the number Math.abs(hash(id) % numberOfShards). You can specify the
number of shards during index creation, but it cannot be changed
afterwards.

Shay is talking about it in some details in this video

starting at 25:24.

On Wednesday, March 28, 2012 9:54:51 AM UTC-4, mp2893 wrote:

One more question,

Since ES is hashing the document id, is there a limit to the number of
documents I can store?
Or is ES using some kind of infinitely expanding hashing algorithm?

2012/3/28 edward choi mp2893@gmail.com

Thanks Igor, for the quick reply.
Routing section was just what I needed to read.

Regards,
Ed

2012/3/28 Igor Motov imotov@gmail.com

By default, elasticsearch is using hash of the document id. So, you
should be fine. In elasticsearch you can control this process by specifying
"routing" parameter with your request. See Routing section of this page
Elasticsearch Platform — Find real-time answers at scale | Elastic for more
details.

On Tuesday, March 27, 2012 1:24:39 PM UTC-4, mp2893 wrote:

Hi,
I am rather new to ES, so this might be a very simple question.

I played with HBase for quite a while, and I know that HBase stores
keys lexicographically.
So generating keys in time series is not recommended. Because when a
query is based on a time range, this kind of key usage leads to users
accessing only a couple of nodes, since the keys are not distributed
out to all the nodes evenly.

I am worried that the same situation might happen with ES. I am
currently storing documents that has time-based id's (such as
20120327_23_20471029471097097)
Should I change the document id's so that documents will be stored
evenly across all the nodes?

Thanks,
Ed

mp2893 · March 29, 2012, 11:52pm

Thanks again for the info.
I will definitely check out the video!!

Regards,
Ed

2012/3/28 Igor Motov imotov@gmail.com

Hash of id is only used to determine to which shard the document should be
assigned. During indexing, elasticsearch assigns each document to the shard
with the number Math.abs(hash(id) % numberOfShards). You can specify the
number of shards during index creation, but it cannot be changed
afterwards.

Shay is talking about it in some details in this video
Elasticsearch Platform — Find real-time answers at scale | Elastic at 25:24.

On Wednesday, March 28, 2012 9:54:51 AM UTC-4, mp2893 wrote:

One more question,

Since ES is hashing the document id, is there a limit to the number of
documents I can store?
Or is ES using some kind of infinitely expanding hashing algorithm?

2012/3/28 edward choi mp2893@gmail.com

Thanks Igor, for the quick reply.
Routing section was just what I needed to read.

Regards,
Ed

2012/3/28 Igor Motov imotov@gmail.com

By default, elasticsearch is using hash of the document id. So, you
should be fine. In elasticsearch you can control this process by specifying
"routing" parameter with your request. See Routing section of this page
http://www.elasticsearch.org/guide/reference/api/index_.htmlhttp://www.elasticsearch.org/guide/reference/api/index_.htmlfor more details.

On Tuesday, March 27, 2012 1:24:39 PM UTC-4, mp2893 wrote:

Hi,
I am rather new to ES, so this might be a very simple question.

I played with HBase for quite a while, and I know that HBase stores
keys lexicographically.
So generating keys in time series is not recommended. Because when a
query is based on a time range, this kind of key usage leads to users
accessing only a couple of nodes, since the keys are not distributed
out to all the nodes evenly.

I am worried that the same situation might happen with ES. I am
currently storing documents that has time-based id's (such as
20120327_23_20471029471097097)
Should I change the document id's so that documents will be stored
evenly across all the nodes?

Thanks,
Ed