How are the keys sorted in ES?


(mp2893) #1

Hi,
I am rather new to ES, so this might be a very simple question.

I played with HBase for quite a while, and I know that HBase stores
keys lexicographically.
So generating keys in time series is not recommended. Because when a
query is based on a time range, this kind of key usage leads to users
accessing only a couple of nodes, since the keys are not distributed
out to all the nodes evenly.

I am worried that the same situation might happen with ES. I am
currently storing documents that has time-based id's (such as
20120327_23_20471029471097097)
Should I change the document id's so that documents will be stored
evenly across all the nodes?

Thanks,
Ed


(Igor Motov) #2

By default, elasticsearch is using hash of the document id. So, you should
be fine. In elasticsearch you can control this process by specifying
"routing" parameter with your request. See Routing section of this
page http://www.elasticsearch.org/guide/reference/api/index_.html for more
details.

On Tuesday, March 27, 2012 1:24:39 PM UTC-4, mp2893 wrote:

Hi,
I am rather new to ES, so this might be a very simple question.

I played with HBase for quite a while, and I know that HBase stores
keys lexicographically.
So generating keys in time series is not recommended. Because when a
query is based on a time range, this kind of key usage leads to users
accessing only a couple of nodes, since the keys are not distributed
out to all the nodes evenly.

I am worried that the same situation might happen with ES. I am
currently storing documents that has time-based id's (such as
20120327_23_20471029471097097)
Should I change the document id's so that documents will be stored
evenly across all the nodes?

Thanks,
Ed


(mp2893) #3

Thanks Igor, for the quick reply.
Routing section was just what I needed to read.

Regards,
Ed

2012/3/28 Igor Motov imotov@gmail.com

By default, elasticsearch is using hash of the document id. So, you should
be fine. In elasticsearch you can control this process by specifying
"routing" parameter with your request. See Routing section of this page
http://www.elasticsearch.org/guide/reference/api/index_.html for more
details.

On Tuesday, March 27, 2012 1:24:39 PM UTC-4, mp2893 wrote:

Hi,
I am rather new to ES, so this might be a very simple question.

I played with HBase for quite a while, and I know that HBase stores
keys lexicographically.
So generating keys in time series is not recommended. Because when a
query is based on a time range, this kind of key usage leads to users
accessing only a couple of nodes, since the keys are not distributed
out to all the nodes evenly.

I am worried that the same situation might happen with ES. I am
currently storing documents that has time-based id's (such as
20120327_23_20471029471097097)
Should I change the document id's so that documents will be stored
evenly across all the nodes?

Thanks,
Ed


(mp2893) #4

One more question,

Since ES is hashing the document id, is there a limit to the number of
documents I can store?
Or is ES using some kind of infinitely expanding hashing algorithm?

2012/3/28 edward choi mp2893@gmail.com

Thanks Igor, for the quick reply.
Routing section was just what I needed to read.

Regards,
Ed

2012/3/28 Igor Motov imotov@gmail.com

By default, elasticsearch is using hash of the document id. So, you
should be fine. In elasticsearch you can control this process by specifying
"routing" parameter with your request. See Routing section of this page
http://www.elasticsearch.org/guide/reference/api/index_.html for more
details.

On Tuesday, March 27, 2012 1:24:39 PM UTC-4, mp2893 wrote:

Hi,
I am rather new to ES, so this might be a very simple question.

I played with HBase for quite a while, and I know that HBase stores
keys lexicographically.
So generating keys in time series is not recommended. Because when a
query is based on a time range, this kind of key usage leads to users
accessing only a couple of nodes, since the keys are not distributed
out to all the nodes evenly.

I am worried that the same situation might happen with ES. I am
currently storing documents that has time-based id's (such as
20120327_23_20471029471097097)
Should I change the document id's so that documents will be stored
evenly across all the nodes?

Thanks,
Ed


(Igor Motov) #5

Hash of id is only used to determine to which shard the document should be
assigned. During indexing, elasticsearch assigns each document to the shard
with the number Math.abs(hash(id) % numberOfShards). You can specify the
number of shards during index creation, but it cannot be changed
afterwards.

Shay is talking about it in some details in this video
http://www.elasticsearch.org/videos/2011/08/09/road-to-a-distributed-searchengine-berlinbuzzwords.html
starting at 25:24.

On Wednesday, March 28, 2012 9:54:51 AM UTC-4, mp2893 wrote:

One more question,

Since ES is hashing the document id, is there a limit to the number of
documents I can store?
Or is ES using some kind of infinitely expanding hashing algorithm?

2012/3/28 edward choi mp2893@gmail.com

Thanks Igor, for the quick reply.
Routing section was just what I needed to read.

Regards,
Ed

2012/3/28 Igor Motov imotov@gmail.com

By default, elasticsearch is using hash of the document id. So, you
should be fine. In elasticsearch you can control this process by specifying
"routing" parameter with your request. See Routing section of this page
http://www.elasticsearch.org/guide/reference/api/index_.html for more
details.

On Tuesday, March 27, 2012 1:24:39 PM UTC-4, mp2893 wrote:

Hi,
I am rather new to ES, so this might be a very simple question.

I played with HBase for quite a while, and I know that HBase stores
keys lexicographically.
So generating keys in time series is not recommended. Because when a
query is based on a time range, this kind of key usage leads to users
accessing only a couple of nodes, since the keys are not distributed
out to all the nodes evenly.

I am worried that the same situation might happen with ES. I am
currently storing documents that has time-based id's (such as
20120327_23_20471029471097097)
Should I change the document id's so that documents will be stored
evenly across all the nodes?

Thanks,
Ed


(mp2893) #6

Thanks again for the info.
I will definitely check out the video!!

Regards,
Ed

2012/3/28 Igor Motov imotov@gmail.com

Hash of id is only used to determine to which shard the document should be
assigned. During indexing, elasticsearch assigns each document to the shard
with the number Math.abs(hash(id) % numberOfShards). You can specify the
number of shards during index creation, but it cannot be changed
afterwards.

Shay is talking about it in some details in this video
http://www.elasticsearch.org/videos/2011/08/09/road-to-a-distributed-searchengine-berlinbuzzwords.htmlstarting at 25:24.

On Wednesday, March 28, 2012 9:54:51 AM UTC-4, mp2893 wrote:

One more question,

Since ES is hashing the document id, is there a limit to the number of
documents I can store?
Or is ES using some kind of infinitely expanding hashing algorithm?

2012/3/28 edward choi mp2893@gmail.com

Thanks Igor, for the quick reply.
Routing section was just what I needed to read.

Regards,
Ed

2012/3/28 Igor Motov imotov@gmail.com

By default, elasticsearch is using hash of the document id. So, you
should be fine. In elasticsearch you can control this process by specifying
"routing" parameter with your request. See Routing section of this page
http://www.elasticsearch.org/guide/reference/api/index_.htmlhttp://www.elasticsearch.org/guide/reference/api/index_.htmlfor more details.

On Tuesday, March 27, 2012 1:24:39 PM UTC-4, mp2893 wrote:

Hi,
I am rather new to ES, so this might be a very simple question.

I played with HBase for quite a while, and I know that HBase stores
keys lexicographically.
So generating keys in time series is not recommended. Because when a
query is based on a time range, this kind of key usage leads to users
accessing only a couple of nodes, since the keys are not distributed
out to all the nodes evenly.

I am worried that the same situation might happen with ES. I am
currently storing documents that has time-based id's (such as
20120327_23_20471029471097097)
Should I change the document id's so that documents will be stored
evenly across all the nodes?

Thanks,
Ed


(system) #7