Hi,
I am rather new to ES, so this might be a very simple question.
I played with HBase for quite a while, and I know that HBase stores
keys lexicographically.
So generating keys in time series is not recommended. Because when a
query is based on a time range, this kind of key usage leads to users
accessing only a couple of nodes, since the keys are not distributed
out to all the nodes evenly.
I am worried that the same situation might happen with ES. I am
currently storing documents that has time-based id's (such as
20120327_23_20471029471097097)
Should I change the document id's so that documents will be stored
evenly across all the nodes?
By default, elasticsearch is using hash of the document id. So, you should
be fine. In elasticsearch you can control this process by specifying
"routing" parameter with your request. See Routing section of this
page Elasticsearch Platform — Find real-time answers at scale | Elastic for more
details.
On Tuesday, March 27, 2012 1:24:39 PM UTC-4, mp2893 wrote:
Hi,
I am rather new to ES, so this might be a very simple question.
I played with HBase for quite a while, and I know that HBase stores
keys lexicographically.
So generating keys in time series is not recommended. Because when a
query is based on a time range, this kind of key usage leads to users
accessing only a couple of nodes, since the keys are not distributed
out to all the nodes evenly.
I am worried that the same situation might happen with ES. I am
currently storing documents that has time-based id's (such as
20120327_23_20471029471097097)
Should I change the document id's so that documents will be stored
evenly across all the nodes?
By default, elasticsearch is using hash of the document id. So, you should
be fine. In elasticsearch you can control this process by specifying
"routing" parameter with your request. See Routing section of this page Elasticsearch Platform — Find real-time answers at scale | Elastic for more
details.
On Tuesday, March 27, 2012 1:24:39 PM UTC-4, mp2893 wrote:
Hi,
I am rather new to ES, so this might be a very simple question.
I played with HBase for quite a while, and I know that HBase stores
keys lexicographically.
So generating keys in time series is not recommended. Because when a
query is based on a time range, this kind of key usage leads to users
accessing only a couple of nodes, since the keys are not distributed
out to all the nodes evenly.
I am worried that the same situation might happen with ES. I am
currently storing documents that has time-based id's (such as
20120327_23_20471029471097097)
Should I change the document id's so that documents will be stored
evenly across all the nodes?
Since ES is hashing the document id, is there a limit to the number of
documents I can store?
Or is ES using some kind of infinitely expanding hashing algorithm?
By default, elasticsearch is using hash of the document id. So, you
should be fine. In elasticsearch you can control this process by specifying
"routing" parameter with your request. See Routing section of this page Elasticsearch Platform — Find real-time answers at scale | Elastic for more
details.
On Tuesday, March 27, 2012 1:24:39 PM UTC-4, mp2893 wrote:
Hi,
I am rather new to ES, so this might be a very simple question.
I played with HBase for quite a while, and I know that HBase stores
keys lexicographically.
So generating keys in time series is not recommended. Because when a
query is based on a time range, this kind of key usage leads to users
accessing only a couple of nodes, since the keys are not distributed
out to all the nodes evenly.
I am worried that the same situation might happen with ES. I am
currently storing documents that has time-based id's (such as
20120327_23_20471029471097097)
Should I change the document id's so that documents will be stored
evenly across all the nodes?
Hash of id is only used to determine to which shard the document should be
assigned. During indexing, elasticsearch assigns each document to the shard
with the number Math.abs(hash(id) % numberOfShards). You can specify the
number of shards during index creation, but it cannot be changed
afterwards.
Shay is talking about it in some details in this video
starting at 25:24.
On Wednesday, March 28, 2012 9:54:51 AM UTC-4, mp2893 wrote:
One more question,
Since ES is hashing the document id, is there a limit to the number of
documents I can store?
Or is ES using some kind of infinitely expanding hashing algorithm?
By default, elasticsearch is using hash of the document id. So, you
should be fine. In elasticsearch you can control this process by specifying
"routing" parameter with your request. See Routing section of this page Elasticsearch Platform — Find real-time answers at scale | Elastic for more
details.
On Tuesday, March 27, 2012 1:24:39 PM UTC-4, mp2893 wrote:
Hi,
I am rather new to ES, so this might be a very simple question.
I played with HBase for quite a while, and I know that HBase stores
keys lexicographically.
So generating keys in time series is not recommended. Because when a
query is based on a time range, this kind of key usage leads to users
accessing only a couple of nodes, since the keys are not distributed
out to all the nodes evenly.
I am worried that the same situation might happen with ES. I am
currently storing documents that has time-based id's (such as
20120327_23_20471029471097097)
Should I change the document id's so that documents will be stored
evenly across all the nodes?
Hash of id is only used to determine to which shard the document should be
assigned. During indexing, elasticsearch assigns each document to the shard
with the number Math.abs(hash(id) % numberOfShards). You can specify the
number of shards during index creation, but it cannot be changed
afterwards.
On Wednesday, March 28, 2012 9:54:51 AM UTC-4, mp2893 wrote:
One more question,
Since ES is hashing the document id, is there a limit to the number of
documents I can store?
Or is ES using some kind of infinitely expanding hashing algorithm?
By default, elasticsearch is using hash of the document id. So, you
should be fine. In elasticsearch you can control this process by specifying
"routing" parameter with your request. See Routing section of this page http://www.elasticsearch.org/guide/reference/api/index_.htmlhttp://www.elasticsearch.org/guide/reference/api/index_.htmlfor more details.
On Tuesday, March 27, 2012 1:24:39 PM UTC-4, mp2893 wrote:
Hi,
I am rather new to ES, so this might be a very simple question.
I played with HBase for quite a while, and I know that HBase stores
keys lexicographically.
So generating keys in time series is not recommended. Because when a
query is based on a time range, this kind of key usage leads to users
accessing only a couple of nodes, since the keys are not distributed
out to all the nodes evenly.
I am worried that the same situation might happen with ES. I am
currently storing documents that has time-based id's (such as
20120327_23_20471029471097097)
Should I change the document id's so that documents will be stored
evenly across all the nodes?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.