Shards/routing documents imbalance problem


(Han JU) #1

Hi,

We've indexed 25M documents into a single index of 128 shards with 1
replica.
The routing parameter is set to a path in the document, which is an int
value:

_routing: {
path: "some_id"
required: true
}

In out 25M documents, there's 167 distinct values of this "some_id" and in
our expectation, ElasticSearch will route these documents evenly across all
shards.
But we've found out that, out of 128 shards, there are 53 empty shards
(with 0 document inside), or, 40% of the shards are not used at all.

My question:

  • is this normal? Do we miss something in configuring routing?
  • does this imbalanced shard utilization affect indexing speed?

We can confirm that all documents are correctly indexed and routing works
(when searching with routing only 1 shard responds with the correct answer).
ElasticSearch version is v1.0.1.

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f7e86ae2-14a8-4381-842d-53adf59ec43d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Kevin Wang) #2

ES will get the shard id by hash(routing)%num of shards, in your case,
there are only 167 distinct values but have 128 shards, I think it's highly
possible there is less than 128 distinct hash values. So some of the shard
will not have any data.

Kevin

On Wednesday, March 26, 2014 9:30:36 PM UTC+11, Han JU wrote:

Hi,

We've indexed 25M documents into a single index of 128 shards with 1
replica.
The routing parameter is set to a path in the document, which is an int
value:

_routing: {
path: "some_id"
required: true
}

In out 25M documents, there's 167 distinct values of this "some_id" and in
our expectation, ElasticSearch will route these documents evenly across all
shards.
But we've found out that, out of 128 shards, there are 53 empty shards
(with 0 document inside), or, 40% of the shards are not used at all.

My question:

  • is this normal? Do we miss something in configuring routing?
  • does this imbalanced shard utilization affect indexing speed?

We can confirm that all documents are correctly indexed and routing works
(when searching with routing only 1 shard responds with the correct answer).
ElasticSearch version is v1.0.1.

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d8961b19-e024-4a04-83fa-48f4cd44b7c4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Han JU) #3

Thanks for your reply.

As far as I know, in Java, basic hash value of positive int/long value is
just themselves (our ids are small values like 1125, 345 etc).
So I calculated some_id % 128, and I got 116 distinct values. But in
reality there's a lot less shards in use.

Does ElasticSearch use some special hash function?

在 2014年3月26日星期三UTC+1上午11时39分15秒,Kevin Wang写道:

ES will get the shard id by hash(routing)%num of shards, in your case,
there are only 167 distinct values but have 128 shards, I think it's highly
possible there is less than 128 distinct hash values. So some of the shard
will not have any data.

Kevin

On Wednesday, March 26, 2014 9:30:36 PM UTC+11, Han JU wrote:

Hi,

We've indexed 25M documents into a single index of 128 shards with 1
replica.
The routing parameter is set to a path in the document, which is an int
value:

_routing: {
path: "some_id"
required: true
}

In out 25M documents, there's 167 distinct values of this "some_id" and
in our expectation, ElasticSearch will route these documents evenly across
all shards.
But we've found out that, out of 128 shards, there are 53 empty shards
(with 0 document inside), or, 40% of the shards are not used at all.

My question:

  • is this normal? Do we miss something in configuring routing?
  • does this imbalanced shard utilization affect indexing speed?

We can confirm that all documents are correctly indexed and routing works
(when searching with routing only 1 shard responds with the correct answer).
ElasticSearch version is v1.0.1.

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f54da2a0-0b7a-49fb-b852-b2200c862b4d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Kevin Wang) #4

There are two hash functions
implementation org.elasticsearch.cluster.routing.operation.hash.djb.DjbHashFunction
and org.elasticsearch.cluster.routing.operation.hash.simple.SimpleHashFunction,
default is DjbHashFunction. You can try get the hash by
using DjbHashFunction.DJB_HASH(you id)

On Wednesday, March 26, 2014 9:49:10 PM UTC+11, Han JU wrote:

Thanks for your reply.

As far as I know, in Java, basic hash value of positive int/long value is
just themselves (our ids are small values like 1125, 345 etc).
So I calculated some_id % 128, and I got 116 distinct values. But in
reality there's a lot less shards in use.

Does ElasticSearch use some special hash function?

在 2014年3月26日星期三UTC+1上午11时39分15秒,Kevin Wang写道:

ES will get the shard id by hash(routing)%num of shards, in your case,
there are only 167 distinct values but have 128 shards, I think it's highly
possible there is less than 128 distinct hash values. So some of the shard
will not have any data.

Kevin

On Wednesday, March 26, 2014 9:30:36 PM UTC+11, Han JU wrote:

Hi,

We've indexed 25M documents into a single index of 128 shards with 1
replica.
The routing parameter is set to a path in the document, which is an
int value:

_routing: {
path: "some_id"
required: true
}

In out 25M documents, there's 167 distinct values of this "some_id" and
in our expectation, ElasticSearch will route these documents evenly across
all shards.
But we've found out that, out of 128 shards, there are 53 empty shards
(with 0 document inside), or, 40% of the shards are not used at all.

My question:

  • is this normal? Do we miss something in configuring routing?
  • does this imbalanced shard utilization affect indexing speed?

We can confirm that all documents are correctly indexed and routing
works (when searching with routing only 1 shard responds with the correct
answer).
ElasticSearch version is v1.0.1.

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9c8a9eba-2f0f-452f-98ac-34463da7f496%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Han JU) #5

Thanks a lot Kevin.

That DJB_HASH result makes it clear for us. I think we'll just use the id
value as hash.
Do you guys know how to plugin a custom hash function?

在 2014年3月26日星期三UTC+1上午11时58分36秒,Kevin Wang写道:

There are two hash functions
implementation org.elasticsearch.cluster.routing.operation.hash.djb.DjbHashFunction
and org.elasticsearch.cluster.routing.operation.hash.simple.SimpleHashFunction,
default is DjbHashFunction. You can try get the hash by
using DjbHashFunction.DJB_HASH(you id)

On Wednesday, March 26, 2014 9:49:10 PM UTC+11, Han JU wrote:

Thanks for your reply.

As far as I know, in Java, basic hash value of positive int/long value is
just themselves (our ids are small values like 1125, 345 etc).
So I calculated some_id % 128, and I got 116 distinct values. But in
reality there's a lot less shards in use.

Does ElasticSearch use some special hash function?

在 2014年3月26日星期三UTC+1上午11时39分15秒,Kevin Wang写道:

ES will get the shard id by hash(routing)%num of shards, in your case,
there are only 167 distinct values but have 128 shards, I think it's highly
possible there is less than 128 distinct hash values. So some of the shard
will not have any data.

Kevin

On Wednesday, March 26, 2014 9:30:36 PM UTC+11, Han JU wrote:

Hi,

We've indexed 25M documents into a single index of 128 shards with 1
replica.
The routing parameter is set to a path in the document, which is an
int value:

_routing: {
path: "some_id"
required: true
}

In out 25M documents, there's 167 distinct values of this "some_id" and
in our expectation, ElasticSearch will route these documents evenly across
all shards.
But we've found out that, out of 128 shards, there are 53 empty shards
(with 0 document inside), or, 40% of the shards are not used at all.

My question:

  • is this normal? Do we miss something in configuring routing?
  • does this imbalanced shard utilization affect indexing speed?

We can confirm that all documents are correctly indexed and routing
works (when searching with routing only 1 shard responds with the correct
answer).
ElasticSearch version is v1.0.1.

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/29da8ba5-ccab-40b0-9174-b6522408dd51%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Han JU) #6

Do you guys know how to plug in a custom hash function for routing
parameter?

在 2014年3月26日星期三UTC+1下午12时51分24秒,Han JU写道:

Thanks a lot Kevin.

That DJB_HASH result makes it clear for us. I think we'll just use the id
value as hash.
Do you guys know how to plugin a custom hash function?

在 2014年3月26日星期三UTC+1上午11时58分36秒,Kevin Wang写道:

There are two hash functions
implementation org.elasticsearch.cluster.routing.operation.hash.djb.DjbHashFunction
and org.elasticsearch.cluster.routing.operation.hash.simple.SimpleHashFunction,
default is DjbHashFunction. You can try get the hash by
using DjbHashFunction.DJB_HASH(you id)

On Wednesday, March 26, 2014 9:49:10 PM UTC+11, Han JU wrote:

Thanks for your reply.

As far as I know, in Java, basic hash value of positive int/long value
is just themselves (our ids are small values like 1125, 345 etc).
So I calculated some_id % 128, and I got 116 distinct values. But in
reality there's a lot less shards in use.

Does ElasticSearch use some special hash function?

在 2014年3月26日星期三UTC+1上午11时39分15秒,Kevin Wang写道:

ES will get the shard id by hash(routing)%num of shards, in your case,
there are only 167 distinct values but have 128 shards, I think it's highly
possible there is less than 128 distinct hash values. So some of the shard
will not have any data.

Kevin

On Wednesday, March 26, 2014 9:30:36 PM UTC+11, Han JU wrote:

Hi,

We've indexed 25M documents into a single index of 128 shards with 1
replica.
The routing parameter is set to a path in the document, which is an
int value:

_routing: {
path: "some_id"
required: true
}

In out 25M documents, there's 167 distinct values of this "some_id"
and in our expectation, ElasticSearch will route these documents evenly
across all shards.
But we've found out that, out of 128 shards, there are 53 empty shards
(with 0 document inside), or, 40% of the shards are not used at all.

My question:

  • is this normal? Do we miss something in configuring routing?
  • does this imbalanced shard utilization affect indexing speed?

We can confirm that all documents are correctly indexed and routing
works (when searching with routing only 1 shard responds with the correct
answer).
ElasticSearch version is v1.0.1.

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/24ea97ab-3795-4def-b284-33742e30a908%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Kevin Wang) #7

You can add a class that implements HashFunction and set the setting
"cluster.routing.operation.hash.type“ to that class.

Regards,
Kevin

On Thursday, March 27, 2014 9:11:39 PM UTC+11, Han JU wrote:

Do you guys know how to plug in a custom hash function for routing
parameter?

在 2014年3月26日星期三UTC+1下午12时51分24秒,Han JU写道:

Thanks a lot Kevin.

That DJB_HASH result makes it clear for us. I think we'll just use the id
value as hash.
Do you guys know how to plugin a custom hash function?

在 2014年3月26日星期三UTC+1上午11时58分36秒,Kevin Wang写道:

There are two hash functions
implementation org.elasticsearch.cluster.routing.operation.hash.djb.DjbHashFunction
and org.elasticsearch.cluster.routing.operation.hash.simple.SimpleHashFunction,
default is DjbHashFunction. You can try get the hash by
using DjbHashFunction.DJB_HASH(you id)

On Wednesday, March 26, 2014 9:49:10 PM UTC+11, Han JU wrote:

Thanks for your reply.

As far as I know, in Java, basic hash value of positive int/long value
is just themselves (our ids are small values like 1125, 345 etc).
So I calculated some_id % 128, and I got 116 distinct values. But in
reality there's a lot less shards in use.

Does ElasticSearch use some special hash function?

在 2014年3月26日星期三UTC+1上午11时39分15秒,Kevin Wang写道:

ES will get the shard id by hash(routing)%num of shards, in your case,
there are only 167 distinct values but have 128 shards, I think it's highly
possible there is less than 128 distinct hash values. So some of the shard
will not have any data.

Kevin

On Wednesday, March 26, 2014 9:30:36 PM UTC+11, Han JU wrote:

Hi,

We've indexed 25M documents into a single index of 128 shards with 1
replica.
The routing parameter is set to a path in the document, which is an
int value:

_routing: {
path: "some_id"
required: true
}

In out 25M documents, there's 167 distinct values of this "some_id"
and in our expectation, ElasticSearch will route these documents evenly
across all shards.
But we've found out that, out of 128 shards, there are 53 empty
shards (with 0 document inside), or, 40% of the shards are not used at all.

My question:

  • is this normal? Do we miss something in configuring routing?
  • does this imbalanced shard utilization affect indexing speed?

We can confirm that all documents are correctly indexed and routing
works (when searching with routing only 1 shard responds with the correct
answer).
ElasticSearch version is v1.0.1.

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9817fbd7-5e75-4557-807f-276df5b3120d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Han JU) #8

Thanks but can you explain some detail?
Say I have the class in MyHashFunction.java, how could I put it in ES? I
need to modify the code of ES or ?

在 2014年3月27日星期四UTC+1上午11时27分24秒,Kevin Wang写道:

You can add a class that implements HashFunction and set the setting
"cluster.routing.operation.hash.type“ to that class.

Regards,
Kevin

On Thursday, March 27, 2014 9:11:39 PM UTC+11, Han JU wrote:

Do you guys know how to plug in a custom hash function for routing
parameter?

在 2014年3月26日星期三UTC+1下午12时51分24秒,Han JU写道:

Thanks a lot Kevin.

That DJB_HASH result makes it clear for us. I think we'll just use the
id value as hash.
Do you guys know how to plugin a custom hash function?

在 2014年3月26日星期三UTC+1上午11时58分36秒,Kevin Wang写道:

There are two hash functions
implementation org.elasticsearch.cluster.routing.operation.hash.djb.DjbHashFunction
and org.elasticsearch.cluster.routing.operation.hash.simple.SimpleHashFunction,
default is DjbHashFunction. You can try get the hash by
using DjbHashFunction.DJB_HASH(you id)

On Wednesday, March 26, 2014 9:49:10 PM UTC+11, Han JU wrote:

Thanks for your reply.

As far as I know, in Java, basic hash value of positive int/long value
is just themselves (our ids are small values like 1125, 345 etc).
So I calculated some_id % 128, and I got 116 distinct values. But in
reality there's a lot less shards in use.

Does ElasticSearch use some special hash function?

在 2014年3月26日星期三UTC+1上午11时39分15秒,Kevin Wang写道:

ES will get the shard id by hash(routing)%num of shards, in your
case, there are only 167 distinct values but have 128 shards, I think it's
highly possible there is less than 128 distinct hash values. So some of the
shard will not have any data.

Kevin

On Wednesday, March 26, 2014 9:30:36 PM UTC+11, Han JU wrote:

Hi,

We've indexed 25M documents into a single index of 128 shards with 1
replica.
The routing parameter is set to a path in the document, which is
an int value:

_routing: {
path: "some_id"
required: true
}

In out 25M documents, there's 167 distinct values of this "some_id"
and in our expectation, ElasticSearch will route these documents evenly
across all shards.
But we've found out that, out of 128 shards, there are 53 empty
shards (with 0 document inside), or, 40% of the shards are not used at all.

My question:

  • is this normal? Do we miss something in configuring routing?
  • does this imbalanced shard utilization affect indexing speed?

We can confirm that all documents are correctly indexed and routing
works (when searching with routing only 1 shard responds with the correct
answer).
ElasticSearch version is v1.0.1.

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c8f96c79-76c1-42eb-8380-ecc044b44a7f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Kevin Wang) #9

You can add that as a plugin,
see http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-plugins.html

On Thursday, March 27, 2014 9:32:28 PM UTC+11, Han JU wrote:

Thanks but can you explain some detail?
Say I have the class in MyHashFunction.java, how could I put it in ES? I
need to modify the code of ES or ?

在 2014年3月27日星期四UTC+1上午11时27分24秒,Kevin Wang写道:

You can add a class that implements HashFunction and set the setting
"cluster.routing.operation.hash.type“ to that class.

Regards,
Kevin

On Thursday, March 27, 2014 9:11:39 PM UTC+11, Han JU wrote:

Do you guys know how to plug in a custom hash function for routing
parameter?

在 2014年3月26日星期三UTC+1下午12时51分24秒,Han JU写道:

Thanks a lot Kevin.

That DJB_HASH result makes it clear for us. I think we'll just use the
id value as hash.
Do you guys know how to plugin a custom hash function?

在 2014年3月26日星期三UTC+1上午11时58分36秒,Kevin Wang写道:

There are two hash functions
implementation org.elasticsearch.cluster.routing.operation.hash.djb.DjbHashFunction
and org.elasticsearch.cluster.routing.operation.hash.simple.SimpleHashFunction,
default is DjbHashFunction. You can try get the hash by
using DjbHashFunction.DJB_HASH(you id)

On Wednesday, March 26, 2014 9:49:10 PM UTC+11, Han JU wrote:

Thanks for your reply.

As far as I know, in Java, basic hash value of positive int/long
value is just themselves (our ids are small values like 1125, 345 etc).
So I calculated some_id % 128, and I got 116 distinct values. But in
reality there's a lot less shards in use.

Does ElasticSearch use some special hash function?

在 2014年3月26日星期三UTC+1上午11时39分15秒,Kevin Wang写道:

ES will get the shard id by hash(routing)%num of shards, in your
case, there are only 167 distinct values but have 128 shards, I think it's
highly possible there is less than 128 distinct hash values. So some of the
shard will not have any data.

Kevin

On Wednesday, March 26, 2014 9:30:36 PM UTC+11, Han JU wrote:

Hi,

We've indexed 25M documents into a single index of 128 shards with
1 replica.
The routing parameter is set to a path in the document, which is
an int value:

_routing: {
path: "some_id"
required: true
}

In out 25M documents, there's 167 distinct values of this "some_id"
and in our expectation, ElasticSearch will route these documents evenly
across all shards.
But we've found out that, out of 128 shards, there are 53 empty
shards (with 0 document inside), or, 40% of the shards are not used at all.

My question:

  • is this normal? Do we miss something in configuring routing?
  • does this imbalanced shard utilization affect indexing speed?

We can confirm that all documents are correctly indexed and routing
works (when searching with routing only 1 shard responds with the correct
answer).
ElasticSearch version is v1.0.1.

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1ade0d24-ee99-4a00-8bf0-7a749271b58d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #10