We've indexed 25M documents into a single index of 128 shards with 1
replica.
The routing parameter is set to a path in the document, which is an int
value:
_routing: {
path: "some_id"
required: true
}
In out 25M documents, there's 167 distinct values of this "some_id" and in
our expectation, ElasticSearch will route these documents evenly across all
shards.
But we've found out that, out of 128 shards, there are 53 empty shards
(with 0 document inside), or, 40% of the shards are not used at all.
My question:
is this normal? Do we miss something in configuring routing?
does this imbalanced shard utilization affect indexing speed?
We can confirm that all documents are correctly indexed and routing works
(when searching with routing only 1 shard responds with the correct answer).
ElasticSearch version is v1.0.1.
ES will get the shard id by hash(routing)%num of shards, in your case,
there are only 167 distinct values but have 128 shards, I think it's highly
possible there is less than 128 distinct hash values. So some of the shard
will not have any data.
Kevin
On Wednesday, March 26, 2014 9:30:36 PM UTC+11, Han JU wrote:
Hi,
We've indexed 25M documents into a single index of 128 shards with 1
replica.
The routing parameter is set to a path in the document, which is an int
value:
_routing: {
path: "some_id"
required: true
}
In out 25M documents, there's 167 distinct values of this "some_id" and in
our expectation, Elasticsearch will route these documents evenly across all
shards.
But we've found out that, out of 128 shards, there are 53 empty shards
(with 0 document inside), or, 40% of the shards are not used at all.
My question:
is this normal? Do we miss something in configuring routing?
does this imbalanced shard utilization affect indexing speed?
We can confirm that all documents are correctly indexed and routing works
(when searching with routing only 1 shard responds with the correct answer).
Elasticsearch version is v1.0.1.
As far as I know, in Java, basic hash value of positive int/long value is
just themselves (our ids are small values like 1125, 345 etc).
So I calculated some_id % 128, and I got 116 distinct values. But in
reality there's a lot less shards in use.
Does Elasticsearch use some special hash function?
在 2014年3月26日星期三UTC+1上午11时39分15秒,Kevin Wang写道:
ES will get the shard id by hash(routing)%num of shards, in your case,
there are only 167 distinct values but have 128 shards, I think it's highly
possible there is less than 128 distinct hash values. So some of the shard
will not have any data.
Kevin
On Wednesday, March 26, 2014 9:30:36 PM UTC+11, Han JU wrote:
Hi,
We've indexed 25M documents into a single index of 128 shards with 1
replica.
The routing parameter is set to a path in the document, which is an int
value:
_routing: {
path: "some_id"
required: true
}
In out 25M documents, there's 167 distinct values of this "some_id" and
in our expectation, Elasticsearch will route these documents evenly across
all shards.
But we've found out that, out of 128 shards, there are 53 empty shards
(with 0 document inside), or, 40% of the shards are not used at all.
My question:
is this normal? Do we miss something in configuring routing?
does this imbalanced shard utilization affect indexing speed?
We can confirm that all documents are correctly indexed and routing works
(when searching with routing only 1 shard responds with the correct answer).
Elasticsearch version is v1.0.1.
There are two hash functions
implementation org.elasticsearch.cluster.routing.operation.hash.djb.DjbHashFunction
and org.elasticsearch.cluster.routing.operation.hash.simple.SimpleHashFunction,
default is DjbHashFunction. You can try get the hash by
using DjbHashFunction.DJB_HASH(you id)
On Wednesday, March 26, 2014 9:49:10 PM UTC+11, Han JU wrote:
Thanks for your reply.
As far as I know, in Java, basic hash value of positive int/long value is
just themselves (our ids are small values like 1125, 345 etc).
So I calculated some_id % 128, and I got 116 distinct values. But in
reality there's a lot less shards in use.
Does Elasticsearch use some special hash function?
在 2014年3月26日星期三UTC+1上午11时39分15秒,Kevin Wang写道:
ES will get the shard id by hash(routing)%num of shards, in your case,
there are only 167 distinct values but have 128 shards, I think it's highly
possible there is less than 128 distinct hash values. So some of the shard
will not have any data.
Kevin
On Wednesday, March 26, 2014 9:30:36 PM UTC+11, Han JU wrote:
Hi,
We've indexed 25M documents into a single index of 128 shards with 1
replica.
The routing parameter is set to a path in the document, which is an
int value:
_routing: {
path: "some_id"
required: true
}
In out 25M documents, there's 167 distinct values of this "some_id" and
in our expectation, Elasticsearch will route these documents evenly across
all shards.
But we've found out that, out of 128 shards, there are 53 empty shards
(with 0 document inside), or, 40% of the shards are not used at all.
My question:
is this normal? Do we miss something in configuring routing?
does this imbalanced shard utilization affect indexing speed?
We can confirm that all documents are correctly indexed and routing
works (when searching with routing only 1 shard responds with the correct
answer).
Elasticsearch version is v1.0.1.
That DJB_HASH result makes it clear for us. I think we'll just use the id
value as hash.
Do you guys know how to plugin a custom hash function?
在 2014年3月26日星期三UTC+1上午11时58分36秒,Kevin Wang写道:
There are two hash functions
implementation org.elasticsearch.cluster.routing.operation.hash.djb.DjbHashFunction
and org.elasticsearch.cluster.routing.operation.hash.simple.SimpleHashFunction,
default is DjbHashFunction. You can try get the hash by
using DjbHashFunction.DJB_HASH(you id)
On Wednesday, March 26, 2014 9:49:10 PM UTC+11, Han JU wrote:
Thanks for your reply.
As far as I know, in Java, basic hash value of positive int/long value is
just themselves (our ids are small values like 1125, 345 etc).
So I calculated some_id % 128, and I got 116 distinct values. But in
reality there's a lot less shards in use.
Does Elasticsearch use some special hash function?
在 2014年3月26日星期三UTC+1上午11时39分15秒,Kevin Wang写道:
ES will get the shard id by hash(routing)%num of shards, in your case,
there are only 167 distinct values but have 128 shards, I think it's highly
possible there is less than 128 distinct hash values. So some of the shard
will not have any data.
Kevin
On Wednesday, March 26, 2014 9:30:36 PM UTC+11, Han JU wrote:
Hi,
We've indexed 25M documents into a single index of 128 shards with 1
replica.
The routing parameter is set to a path in the document, which is an
int value:
_routing: {
path: "some_id"
required: true
}
In out 25M documents, there's 167 distinct values of this "some_id" and
in our expectation, Elasticsearch will route these documents evenly across
all shards.
But we've found out that, out of 128 shards, there are 53 empty shards
(with 0 document inside), or, 40% of the shards are not used at all.
My question:
is this normal? Do we miss something in configuring routing?
does this imbalanced shard utilization affect indexing speed?
We can confirm that all documents are correctly indexed and routing
works (when searching with routing only 1 shard responds with the correct
answer).
Elasticsearch version is v1.0.1.
Do you guys know how to plug in a custom hash function for routing
parameter?
在 2014年3月26日星期三UTC+1下午12时51分24秒,Han JU写道:
Thanks a lot Kevin.
That DJB_HASH result makes it clear for us. I think we'll just use the id
value as hash.
Do you guys know how to plugin a custom hash function?
在 2014年3月26日星期三UTC+1上午11时58分36秒,Kevin Wang写道:
There are two hash functions
implementation org.elasticsearch.cluster.routing.operation.hash.djb.DjbHashFunction
and org.elasticsearch.cluster.routing.operation.hash.simple.SimpleHashFunction,
default is DjbHashFunction. You can try get the hash by
using DjbHashFunction.DJB_HASH(you id)
On Wednesday, March 26, 2014 9:49:10 PM UTC+11, Han JU wrote:
Thanks for your reply.
As far as I know, in Java, basic hash value of positive int/long value
is just themselves (our ids are small values like 1125, 345 etc).
So I calculated some_id % 128, and I got 116 distinct values. But in
reality there's a lot less shards in use.
Does Elasticsearch use some special hash function?
在 2014年3月26日星期三UTC+1上午11时39分15秒,Kevin Wang写道:
ES will get the shard id by hash(routing)%num of shards, in your case,
there are only 167 distinct values but have 128 shards, I think it's highly
possible there is less than 128 distinct hash values. So some of the shard
will not have any data.
Kevin
On Wednesday, March 26, 2014 9:30:36 PM UTC+11, Han JU wrote:
Hi,
We've indexed 25M documents into a single index of 128 shards with 1
replica.
The routing parameter is set to a path in the document, which is an
int value:
_routing: {
path: "some_id"
required: true
}
In out 25M documents, there's 167 distinct values of this "some_id"
and in our expectation, Elasticsearch will route these documents evenly
across all shards.
But we've found out that, out of 128 shards, there are 53 empty shards
(with 0 document inside), or, 40% of the shards are not used at all.
My question:
is this normal? Do we miss something in configuring routing?
does this imbalanced shard utilization affect indexing speed?
We can confirm that all documents are correctly indexed and routing
works (when searching with routing only 1 shard responds with the correct
answer).
Elasticsearch version is v1.0.1.
You can add a class that implements HashFunction and set the setting
"cluster.routing.operation.hash.type“ to that class.
Regards,
Kevin
On Thursday, March 27, 2014 9:11:39 PM UTC+11, Han JU wrote:
Do you guys know how to plug in a custom hash function for routing
parameter?
在 2014年3月26日星期三UTC+1下午12时51分24秒,Han JU写道:
Thanks a lot Kevin.
That DJB_HASH result makes it clear for us. I think we'll just use the id
value as hash.
Do you guys know how to plugin a custom hash function?
在 2014年3月26日星期三UTC+1上午11时58分36秒,Kevin Wang写道:
There are two hash functions
implementation org.elasticsearch.cluster.routing.operation.hash.djb.DjbHashFunction
and org.elasticsearch.cluster.routing.operation.hash.simple.SimpleHashFunction,
default is DjbHashFunction. You can try get the hash by
using DjbHashFunction.DJB_HASH(you id)
On Wednesday, March 26, 2014 9:49:10 PM UTC+11, Han JU wrote:
Thanks for your reply.
As far as I know, in Java, basic hash value of positive int/long value
is just themselves (our ids are small values like 1125, 345 etc).
So I calculated some_id % 128, and I got 116 distinct values. But in
reality there's a lot less shards in use.
Does Elasticsearch use some special hash function?
在 2014年3月26日星期三UTC+1上午11时39分15秒,Kevin Wang写道:
ES will get the shard id by hash(routing)%num of shards, in your case,
there are only 167 distinct values but have 128 shards, I think it's highly
possible there is less than 128 distinct hash values. So some of the shard
will not have any data.
Kevin
On Wednesday, March 26, 2014 9:30:36 PM UTC+11, Han JU wrote:
Hi,
We've indexed 25M documents into a single index of 128 shards with 1
replica.
The routing parameter is set to a path in the document, which is an
int value:
_routing: {
path: "some_id"
required: true
}
In out 25M documents, there's 167 distinct values of this "some_id"
and in our expectation, Elasticsearch will route these documents evenly
across all shards.
But we've found out that, out of 128 shards, there are 53 empty
shards (with 0 document inside), or, 40% of the shards are not used at all.
My question:
is this normal? Do we miss something in configuring routing?
does this imbalanced shard utilization affect indexing speed?
We can confirm that all documents are correctly indexed and routing
works (when searching with routing only 1 shard responds with the correct
answer).
Elasticsearch version is v1.0.1.
Thanks but can you explain some detail?
Say I have the class in MyHashFunction.java, how could I put it in ES? I
need to modify the code of ES or ?
在 2014年3月27日星期四UTC+1上午11时27分24秒,Kevin Wang写道:
You can add a class that implements HashFunction and set the setting
"cluster.routing.operation.hash.type“ to that class.
Regards,
Kevin
On Thursday, March 27, 2014 9:11:39 PM UTC+11, Han JU wrote:
Do you guys know how to plug in a custom hash function for routing
parameter?
在 2014年3月26日星期三UTC+1下午12时51分24秒,Han JU写道:
Thanks a lot Kevin.
That DJB_HASH result makes it clear for us. I think we'll just use the
id value as hash.
Do you guys know how to plugin a custom hash function?
在 2014年3月26日星期三UTC+1上午11时58分36秒,Kevin Wang写道:
There are two hash functions
implementation org.elasticsearch.cluster.routing.operation.hash.djb.DjbHashFunction
and org.elasticsearch.cluster.routing.operation.hash.simple.SimpleHashFunction,
default is DjbHashFunction. You can try get the hash by
using DjbHashFunction.DJB_HASH(you id)
On Wednesday, March 26, 2014 9:49:10 PM UTC+11, Han JU wrote:
Thanks for your reply.
As far as I know, in Java, basic hash value of positive int/long value
is just themselves (our ids are small values like 1125, 345 etc).
So I calculated some_id % 128, and I got 116 distinct values. But in
reality there's a lot less shards in use.
Does Elasticsearch use some special hash function?
在 2014年3月26日星期三UTC+1上午11时39分15秒,Kevin Wang写道:
ES will get the shard id by hash(routing)%num of shards, in your
case, there are only 167 distinct values but have 128 shards, I think it's
highly possible there is less than 128 distinct hash values. So some of the
shard will not have any data.
Kevin
On Wednesday, March 26, 2014 9:30:36 PM UTC+11, Han JU wrote:
Hi,
We've indexed 25M documents into a single index of 128 shards with 1
replica.
The routing parameter is set to a path in the document, which is
an int value:
_routing: {
path: "some_id"
required: true
}
In out 25M documents, there's 167 distinct values of this "some_id"
and in our expectation, Elasticsearch will route these documents evenly
across all shards.
But we've found out that, out of 128 shards, there are 53 empty
shards (with 0 document inside), or, 40% of the shards are not used at all.
My question:
is this normal? Do we miss something in configuring routing?
does this imbalanced shard utilization affect indexing speed?
We can confirm that all documents are correctly indexed and routing
works (when searching with routing only 1 shard responds with the correct
answer).
Elasticsearch version is v1.0.1.
On Thursday, March 27, 2014 9:32:28 PM UTC+11, Han JU wrote:
Thanks but can you explain some detail?
Say I have the class in MyHashFunction.java, how could I put it in ES? I
need to modify the code of ES or ?
在 2014年3月27日星期四UTC+1上午11时27分24秒,Kevin Wang写道:
You can add a class that implements HashFunction and set the setting
"cluster.routing.operation.hash.type“ to that class.
Regards,
Kevin
On Thursday, March 27, 2014 9:11:39 PM UTC+11, Han JU wrote:
Do you guys know how to plug in a custom hash function for routing
parameter?
在 2014年3月26日星期三UTC+1下午12时51分24秒,Han JU写道:
Thanks a lot Kevin.
That DJB_HASH result makes it clear for us. I think we'll just use the
id value as hash.
Do you guys know how to plugin a custom hash function?
在 2014年3月26日星期三UTC+1上午11时58分36秒,Kevin Wang写道:
There are two hash functions
implementation org.elasticsearch.cluster.routing.operation.hash.djb.DjbHashFunction
and org.elasticsearch.cluster.routing.operation.hash.simple.SimpleHashFunction,
default is DjbHashFunction. You can try get the hash by
using DjbHashFunction.DJB_HASH(you id)
On Wednesday, March 26, 2014 9:49:10 PM UTC+11, Han JU wrote:
Thanks for your reply.
As far as I know, in Java, basic hash value of positive int/long
value is just themselves (our ids are small values like 1125, 345 etc).
So I calculated some_id % 128, and I got 116 distinct values. But in
reality there's a lot less shards in use.
Does Elasticsearch use some special hash function?
在 2014年3月26日星期三UTC+1上午11时39分15秒,Kevin Wang写道:
ES will get the shard id by hash(routing)%num of shards, in your
case, there are only 167 distinct values but have 128 shards, I think it's
highly possible there is less than 128 distinct hash values. So some of the
shard will not have any data.
Kevin
On Wednesday, March 26, 2014 9:30:36 PM UTC+11, Han JU wrote:
Hi,
We've indexed 25M documents into a single index of 128 shards with
1 replica.
The routing parameter is set to a path in the document, which is
an int value:
_routing: {
path: "some_id"
required: true
}
In out 25M documents, there's 167 distinct values of this "some_id"
and in our expectation, Elasticsearch will route these documents evenly
across all shards.
But we've found out that, out of 128 shards, there are 53 empty
shards (with 0 document inside), or, 40% of the shards are not used at all.
My question:
is this normal? Do we miss something in configuring routing?
does this imbalanced shard utilization affect indexing speed?
We can confirm that all documents are correctly indexed and routing
works (when searching with routing only 1 shard responds with the correct
answer).
Elasticsearch version is v1.0.1.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.