Shards/routing documents imbalance problem

Han_JU · March 26, 2014, 10:30am

Hi,

We've indexed 25M documents into a single index of 128 shards with 1
replica.
The routing parameter is set to a path in the document, which is an int
value:

_routing: {
path: "some_id"
required: true
}

In out 25M documents, there's 167 distinct values of this "some_id" and in
our expectation, ElasticSearch will route these documents evenly across all
shards.
But we've found out that, out of 128 shards, there are 53 empty shards
(with 0 document inside), or, 40% of the shards are not used at all.

My question:

is this normal? Do we miss something in configuring routing?
does this imbalanced shard utilization affect indexing speed?

We can confirm that all documents are correctly indexed and routing works
(when searching with routing only 1 shard responds with the correct answer).
ElasticSearch version is v1.0.1.

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f7e86ae2-14a8-4381-842d-53adf59ec43d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kevin_Wang · March 26, 2014, 10:39am

ES will get the shard id by hash(routing)%num of shards, in your case,
there are only 167 distinct values but have 128 shards, I think it's highly
possible there is less than 128 distinct hash values. So some of the shard
will not have any data.

Kevin

On Wednesday, March 26, 2014 9:30:36 PM UTC+11, Han JU wrote:

Hi,

We've indexed 25M documents into a single index of 128 shards with 1
replica.
The routing parameter is set to a path in the document, which is an int
value:

_routing: {
path: "some_id"
required: true
}

In out 25M documents, there's 167 distinct values of this "some_id" and in
our expectation, Elasticsearch will route these documents evenly across all
shards.
But we've found out that, out of 128 shards, there are 53 empty shards
(with 0 document inside), or, 40% of the shards are not used at all.

My question:

is this normal? Do we miss something in configuring routing?

does this imbalanced shard utilization affect indexing speed?

We can confirm that all documents are correctly indexed and routing works
(when searching with routing only 1 shard responds with the correct answer).
Elasticsearch version is v1.0.1.

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d8961b19-e024-4a04-83fa-48f4cd44b7c4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Han_JU · March 26, 2014, 10:49am

Thanks for your reply.

As far as I know, in Java, basic hash value of positive int/long value is
just themselves (our ids are small values like 1125, 345 etc).
So I calculated some_id % 128, and I got 116 distinct values. But in
reality there's a lot less shards in use.

Does Elasticsearch use some special hash function?

在 2014年3月26日星期三UTC+1上午11时39分15秒，Kevin Wang写道：

ES will get the shard id by hash(routing)%num of shards, in your case,
there are only 167 distinct values but have 128 shards, I think it's highly
possible there is less than 128 distinct hash values. So some of the shard
will not have any data.

Kevin

On Wednesday, March 26, 2014 9:30:36 PM UTC+11, Han JU wrote:

Hi,

We've indexed 25M documents into a single index of 128 shards with 1
replica.
The routing parameter is set to a path in the document, which is an int
value:

_routing: {
path: "some_id"
required: true
}

In out 25M documents, there's 167 distinct values of this "some_id" and
in our expectation, Elasticsearch will route these documents evenly across
all shards.
But we've found out that, out of 128 shards, there are 53 empty shards
(with 0 document inside), or, 40% of the shards are not used at all.

My question:

is this normal? Do we miss something in configuring routing?

does this imbalanced shard utilization affect indexing speed?

We can confirm that all documents are correctly indexed and routing works
(when searching with routing only 1 shard responds with the correct answer).
Elasticsearch version is v1.0.1.

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f54da2a0-0b7a-49fb-b852-b2200c862b4d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kevin_Wang · March 26, 2014, 10:58am

There are two hash functions
implementation org.elasticsearch.cluster.routing.operation.hash.djb.DjbHashFunction
and org.elasticsearch.cluster.routing.operation.hash.simple.SimpleHashFunction,
default is DjbHashFunction. You can try get the hash by
using DjbHashFunction.DJB_HASH(you id)

On Wednesday, March 26, 2014 9:49:10 PM UTC+11, Han JU wrote:

Thanks for your reply.

As far as I know, in Java, basic hash value of positive int/long value is
just themselves (our ids are small values like 1125, 345 etc).
So I calculated some_id % 128, and I got 116 distinct values. But in
reality there's a lot less shards in use.

Does Elasticsearch use some special hash function?

在 2014年3月26日星期三UTC+1上午11时39分15秒，Kevin Wang写道：

ES will get the shard id by hash(routing)%num of shards, in your case,
there are only 167 distinct values but have 128 shards, I think it's highly
possible there is less than 128 distinct hash values. So some of the shard
will not have any data.

Kevin

On Wednesday, March 26, 2014 9:30:36 PM UTC+11, Han JU wrote:

Hi,

We've indexed 25M documents into a single index of 128 shards with 1
replica.
The routing parameter is set to a path in the document, which is an
int value:

_routing: {
path: "some_id"
required: true
}

In out 25M documents, there's 167 distinct values of this "some_id" and
in our expectation, Elasticsearch will route these documents evenly across
all shards.
But we've found out that, out of 128 shards, there are 53 empty shards
(with 0 document inside), or, 40% of the shards are not used at all.

My question:

is this normal? Do we miss something in configuring routing?

does this imbalanced shard utilization affect indexing speed?

We can confirm that all documents are correctly indexed and routing
works (when searching with routing only 1 shard responds with the correct
answer).
Elasticsearch version is v1.0.1.

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9c8a9eba-2f0f-452f-98ac-34463da7f496%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Han_JU · March 26, 2014, 11:51am

Thanks a lot Kevin.

That DJB_HASH result makes it clear for us. I think we'll just use the id
value as hash.
Do you guys know how to plugin a custom hash function?

在 2014年3月26日星期三UTC+1上午11时58分36秒，Kevin Wang写道：

There are two hash functions
implementation org.elasticsearch.cluster.routing.operation.hash.djb.DjbHashFunction
and org.elasticsearch.cluster.routing.operation.hash.simple.SimpleHashFunction,
default is DjbHashFunction. You can try get the hash by
using DjbHashFunction.DJB_HASH(you id)

On Wednesday, March 26, 2014 9:49:10 PM UTC+11, Han JU wrote:

Thanks for your reply.

As far as I know, in Java, basic hash value of positive int/long value is
just themselves (our ids are small values like 1125, 345 etc).
So I calculated some_id % 128, and I got 116 distinct values. But in
reality there's a lot less shards in use.

Does Elasticsearch use some special hash function?

在 2014年3月26日星期三UTC+1上午11时39分15秒，Kevin Wang写道：

ES will get the shard id by hash(routing)%num of shards, in your case,
there are only 167 distinct values but have 128 shards, I think it's highly
possible there is less than 128 distinct hash values. So some of the shard
will not have any data.

Kevin

On Wednesday, March 26, 2014 9:30:36 PM UTC+11, Han JU wrote:

Hi,

We've indexed 25M documents into a single index of 128 shards with 1
replica.
The routing parameter is set to a path in the document, which is an
int value:

_routing: {
path: "some_id"
required: true
}

In out 25M documents, there's 167 distinct values of this "some_id" and
in our expectation, Elasticsearch will route these documents evenly across
all shards.
But we've found out that, out of 128 shards, there are 53 empty shards
(with 0 document inside), or, 40% of the shards are not used at all.

My question:

is this normal? Do we miss something in configuring routing?

does this imbalanced shard utilization affect indexing speed?

We can confirm that all documents are correctly indexed and routing
works (when searching with routing only 1 shard responds with the correct
answer).
Elasticsearch version is v1.0.1.

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/29da8ba5-ccab-40b0-9174-b6522408dd51%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Han_JU · March 27, 2014, 10:11am

Do you guys know how to plug in a custom hash function for routing
parameter?

在 2014年3月26日星期三UTC+1下午12时51分24秒，Han JU写道：

Thanks a lot Kevin.

That DJB_HASH result makes it clear for us. I think we'll just use the id
value as hash.
Do you guys know how to plugin a custom hash function?

在 2014年3月26日星期三UTC+1上午11时58分36秒，Kevin Wang写道：

There are two hash functions
implementation org.elasticsearch.cluster.routing.operation.hash.djb.DjbHashFunction
and org.elasticsearch.cluster.routing.operation.hash.simple.SimpleHashFunction,
default is DjbHashFunction. You can try get the hash by
using DjbHashFunction.DJB_HASH(you id)

On Wednesday, March 26, 2014 9:49:10 PM UTC+11, Han JU wrote:

Thanks for your reply.

As far as I know, in Java, basic hash value of positive int/long value
is just themselves (our ids are small values like 1125, 345 etc).
So I calculated some_id % 128, and I got 116 distinct values. But in
reality there's a lot less shards in use.

Does Elasticsearch use some special hash function?

在 2014年3月26日星期三UTC+1上午11时39分15秒，Kevin Wang写道：

ES will get the shard id by hash(routing)%num of shards, in your case,
there are only 167 distinct values but have 128 shards, I think it's highly
possible there is less than 128 distinct hash values. So some of the shard
will not have any data.

Kevin

On Wednesday, March 26, 2014 9:30:36 PM UTC+11, Han JU wrote:

Hi,

We've indexed 25M documents into a single index of 128 shards with 1
replica.
The routing parameter is set to a path in the document, which is an
int value:

_routing: {
path: "some_id"
required: true
}

In out 25M documents, there's 167 distinct values of this "some_id"
and in our expectation, Elasticsearch will route these documents evenly
across all shards.
But we've found out that, out of 128 shards, there are 53 empty shards
(with 0 document inside), or, 40% of the shards are not used at all.

My question:

is this normal? Do we miss something in configuring routing?

does this imbalanced shard utilization affect indexing speed?

We can confirm that all documents are correctly indexed and routing
works (when searching with routing only 1 shard responds with the correct
answer).
Elasticsearch version is v1.0.1.

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/24ea97ab-3795-4def-b284-33742e30a908%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kevin_Wang · March 27, 2014, 10:27am

You can add a class that implements HashFunction and set the setting
"cluster.routing.operation.hash.type“ to that class.

Regards,
Kevin

On Thursday, March 27, 2014 9:11:39 PM UTC+11, Han JU wrote:

Do you guys know how to plug in a custom hash function for routing
parameter?

在 2014年3月26日星期三UTC+1下午12时51分24秒，Han JU写道：

Thanks a lot Kevin.

That DJB_HASH result makes it clear for us. I think we'll just use the id
value as hash.
Do you guys know how to plugin a custom hash function?

在 2014年3月26日星期三UTC+1上午11时58分36秒，Kevin Wang写道：

There are two hash functions
implementation org.elasticsearch.cluster.routing.operation.hash.djb.DjbHashFunction
and org.elasticsearch.cluster.routing.operation.hash.simple.SimpleHashFunction,
default is DjbHashFunction. You can try get the hash by
using DjbHashFunction.DJB_HASH(you id)

On Wednesday, March 26, 2014 9:49:10 PM UTC+11, Han JU wrote:

Thanks for your reply.

As far as I know, in Java, basic hash value of positive int/long value
is just themselves (our ids are small values like 1125, 345 etc).
So I calculated some_id % 128, and I got 116 distinct values. But in
reality there's a lot less shards in use.

Does Elasticsearch use some special hash function?

在 2014年3月26日星期三UTC+1上午11时39分15秒，Kevin Wang写道：

ES will get the shard id by hash(routing)%num of shards, in your case,
there are only 167 distinct values but have 128 shards, I think it's highly
possible there is less than 128 distinct hash values. So some of the shard
will not have any data.

Kevin

On Wednesday, March 26, 2014 9:30:36 PM UTC+11, Han JU wrote:

Hi,

We've indexed 25M documents into a single index of 128 shards with 1
replica.
The routing parameter is set to a path in the document, which is an
int value:

_routing: {
path: "some_id"
required: true
}

In out 25M documents, there's 167 distinct values of this "some_id"
and in our expectation, Elasticsearch will route these documents evenly
across all shards.
But we've found out that, out of 128 shards, there are 53 empty
shards (with 0 document inside), or, 40% of the shards are not used at all.

My question:

is this normal? Do we miss something in configuring routing?

does this imbalanced shard utilization affect indexing speed?

We can confirm that all documents are correctly indexed and routing
works (when searching with routing only 1 shard responds with the correct
answer).
Elasticsearch version is v1.0.1.

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9817fbd7-5e75-4557-807f-276df5b3120d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Han_JU · March 27, 2014, 10:32am

Thanks but can you explain some detail?
Say I have the class in MyHashFunction.java, how could I put it in ES? I
need to modify the code of ES or ?

在 2014年3月27日星期四UTC+1上午11时27分24秒，Kevin Wang写道：

You can add a class that implements HashFunction and set the setting
"cluster.routing.operation.hash.type“ to that class.

Regards,
Kevin

On Thursday, March 27, 2014 9:11:39 PM UTC+11, Han JU wrote:

Do you guys know how to plug in a custom hash function for routing
parameter?

在 2014年3月26日星期三UTC+1下午12时51分24秒，Han JU写道：

Thanks a lot Kevin.

That DJB_HASH result makes it clear for us. I think we'll just use the
id value as hash.
Do you guys know how to plugin a custom hash function?

在 2014年3月26日星期三UTC+1上午11时58分36秒，Kevin Wang写道：

There are two hash functions
implementation org.elasticsearch.cluster.routing.operation.hash.djb.DjbHashFunction
and org.elasticsearch.cluster.routing.operation.hash.simple.SimpleHashFunction,
default is DjbHashFunction. You can try get the hash by
using DjbHashFunction.DJB_HASH(you id)

On Wednesday, March 26, 2014 9:49:10 PM UTC+11, Han JU wrote:

Thanks for your reply.

As far as I know, in Java, basic hash value of positive int/long value
is just themselves (our ids are small values like 1125, 345 etc).
So I calculated some_id % 128, and I got 116 distinct values. But in
reality there's a lot less shards in use.

Does Elasticsearch use some special hash function?

在 2014年3月26日星期三UTC+1上午11时39分15秒，Kevin Wang写道：

ES will get the shard id by hash(routing)%num of shards, in your
case, there are only 167 distinct values but have 128 shards, I think it's
highly possible there is less than 128 distinct hash values. So some of the
shard will not have any data.

Kevin

On Wednesday, March 26, 2014 9:30:36 PM UTC+11, Han JU wrote:

Hi,

We've indexed 25M documents into a single index of 128 shards with 1
replica.
The routing parameter is set to a path in the document, which is
an int value:

_routing: {
path: "some_id"
required: true
}

In out 25M documents, there's 167 distinct values of this "some_id"
and in our expectation, Elasticsearch will route these documents evenly
across all shards.
But we've found out that, out of 128 shards, there are 53 empty
shards (with 0 document inside), or, 40% of the shards are not used at all.

My question:

is this normal? Do we miss something in configuring routing?

does this imbalanced shard utilization affect indexing speed?

We can confirm that all documents are correctly indexed and routing
works (when searching with routing only 1 shard responds with the correct
answer).
Elasticsearch version is v1.0.1.

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c8f96c79-76c1-42eb-8380-ecc044b44a7f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kevin_Wang · March 27, 2014, 10:37am

You can add that as a plugin,
see Elasticsearch Platform — Find real-time answers at scale | Elastic

On Thursday, March 27, 2014 9:32:28 PM UTC+11, Han JU wrote:

Thanks but can you explain some detail?
Say I have the class in MyHashFunction.java, how could I put it in ES? I
need to modify the code of ES or ?

在 2014年3月27日星期四UTC+1上午11时27分24秒，Kevin Wang写道：

You can add a class that implements HashFunction and set the setting
"cluster.routing.operation.hash.type“ to that class.

Regards,
Kevin

On Thursday, March 27, 2014 9:11:39 PM UTC+11, Han JU wrote:

Do you guys know how to plug in a custom hash function for routing
parameter?

在 2014年3月26日星期三UTC+1下午12时51分24秒，Han JU写道：

Thanks a lot Kevin.

That DJB_HASH result makes it clear for us. I think we'll just use the
id value as hash.
Do you guys know how to plugin a custom hash function?

在 2014年3月26日星期三UTC+1上午11时58分36秒，Kevin Wang写道：

There are two hash functions
implementation org.elasticsearch.cluster.routing.operation.hash.djb.DjbHashFunction
and org.elasticsearch.cluster.routing.operation.hash.simple.SimpleHashFunction,
default is DjbHashFunction. You can try get the hash by
using DjbHashFunction.DJB_HASH(you id)

On Wednesday, March 26, 2014 9:49:10 PM UTC+11, Han JU wrote:

Thanks for your reply.

As far as I know, in Java, basic hash value of positive int/long
value is just themselves (our ids are small values like 1125, 345 etc).
So I calculated some_id % 128, and I got 116 distinct values. But in
reality there's a lot less shards in use.

Does Elasticsearch use some special hash function?

在 2014年3月26日星期三UTC+1上午11时39分15秒，Kevin Wang写道：

ES will get the shard id by hash(routing)%num of shards, in your
case, there are only 167 distinct values but have 128 shards, I think it's
highly possible there is less than 128 distinct hash values. So some of the
shard will not have any data.

Kevin

On Wednesday, March 26, 2014 9:30:36 PM UTC+11, Han JU wrote:

Hi,

We've indexed 25M documents into a single index of 128 shards with
1 replica.
The routing parameter is set to a path in the document, which is
an int value:

_routing: {
path: "some_id"
required: true
}

In out 25M documents, there's 167 distinct values of this "some_id"
and in our expectation, Elasticsearch will route these documents evenly
across all shards.
But we've found out that, out of 128 shards, there are 53 empty
shards (with 0 document inside), or, 40% of the shards are not used at all.

My question:

is this normal? Do we miss something in configuring routing?

does this imbalanced shard utilization affect indexing speed?

We can confirm that all documents are correctly indexed and routing
works (when searching with routing only 1 shard responds with the correct
answer).
Elasticsearch version is v1.0.1.

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1ade0d24-ee99-4a00-8bf0-7a749271b58d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
How to distribute documents across shards equally, using _routing in Elasticsearch Elasticsearch	10	630	April 25, 2022
Documents not getting sharded evenly Elasticsearch	14	1617	July 5, 2017
[SOLVED] Customing document routing Elasticsearch	7	795	July 5, 2017
Problem with routing Elasticsearch has shard which has 0 document Elasticsearch	7	374	July 6, 2017
Just Pushed: API: Allow to control document shard routing, and search shard routing Elasticsearch	2	328	July 6, 2017

Shards/routing documents imbalance problem

Related topics