Not recovering all shards, why?

Hi,

I see strange thing with the latest master. (The below log output can be
found also here https://gist.github.com/930879)

I create a new index (lets call it a "dummyIndex"), 3 shards, 1 replica.
Because the cluster has only one node then the status is always yellow, that
is fine.

[2011-04-20 05:28:41,658][INFO ][cluster.metadata ] [Hellstrom,
Patsy] [dummyIndex] creating index, cause [auto(index api)], shards [3]/[1],
mappings [mail]
[2011-04-20 05:28:41,688][DEBUG][index.gateway ] [Hellstrom,
Patsy] [dummyIndex][0] starting recovery from local ...
[2011-04-20 05:28:41,693][DEBUG][index.gateway ] [Hellstrom,
Patsy] [dummyIndex][0] recovery completed from local, took [6ms]
index : files [0] with total_size [0b], took[0s]
: recovered_files [0] with total_size [0b]
: reusing_files [0] with total_size [0b]
translog : number_of_operations [0], took [5ms]
[2011-04-20 05:28:41,713][DEBUG][index.gateway ] [Hellstrom,
Patsy] [dummyIndex][1] starting recovery from local ...
[2011-04-20 05:28:41,719][DEBUG][index.gateway ] [Hellstrom,
Patsy] [dummyIndex][1] recovery completed from local, took [6ms]
index : files [0] with total_size [0b], took[0s]
: recovered_files [0] with total_size [0b]
: reusing_files [0] with total_size [0b]
translog : number_of_operations [0], took [5ms]
[2011-04-20 05:28:41,739][DEBUG][index.gateway ] [Hellstrom,
Patsy] [dummyIndex][2] starting recovery from local ...
[2011-04-20 05:28:41,744][DEBUG][index.gateway ] [Hellstrom,
Patsy] [dummyIndex][2] recovery completed from local, took [5ms]
index : files [0] with total_size [0b], took[0s]
: recovered_files [0] with total_size [0b]
: reusing_files [0] with total_size [0b]
translog : number_of_operations [0], took [4ms]
[2011-04-20 05:28:41,821][INFO ][cluster.metadata ] [Hellstrom,
Patsy] [dummyIndex] update_mapping [mail] (dynamic)

Then I index some data into it and shutdown the cluster (node is using the
default local gateway).
Later I start it again but one shard is not recovered, why?

[2011-04-20 05:37:27,050][DEBUG][gateway.local ] [Vashti]
[dummyIndex][0]:
throttling allocation [[dummyIndex][0], node[null], [P], s[UNASSIGNED]] to
[[Vashti][X0Hk2SAVTfOBjVUcMAPOCQ][inet[/10.16.88.223:9300]]] on primary
allocation
[2011-04-20 05:37:27,051][DEBUG][gateway.local ] [Vashti]
[dummyIndex][1]:
throttling allocation [[dummyIndex][1], node[null], [P], s[UNASSIGNED]] to
[[Vashti][X0Hk2SAVTfOBjVUcMAPOCQ][inet[/10.16.88.223:9300]]] on primary
allocation
[2011-04-20 05:37:27,051][DEBUG][gateway.local ] [Vashti]
[dummyIndex][2]:
not allocating, number_of_allocated_shards_found [0], required_number [1]

As a side note, there are more indices created the same way on this node
(say ~8 other indices) and they are always correctly recovered, only this
specific index is not. May this has to do something with the fact that when
I am indexing data into the cluster then some threads on my client are
hanging (not sure if these threads are indexing into dummyIndex, will try to
find out), however, even if the threads are hanging I see the status on all
indices are yellow, not red, before the shudown.

Regards,
Lukas

If the status is yellow, then those indices should be recovered. What we can see is that there is a missing shard so its not being recovered. Are you sure that it was yellow before you shut it down?
On Wednesday, April 20, 2011 at 12:53 PM, Lukáš Vlček wrote:

Hi,

I see strange thing with the latest master. (The below log output can be found also here https://gist.github.com/930879)

I create a new index (lets call it a "dummyIndex"), 3 shards, 1 replica. Because the cluster has only one node then the status is always yellow, that is fine.

[2011-04-20 05:28:41,658][INFO ][cluster.metadata ] [Hellstrom, Patsy] [dummyIndex] creating index, cause [auto(index api)], shards [3]/[1], mappings [mail]
[2011-04-20 05:28:41,688][DEBUG][index.gateway ] [Hellstrom, Patsy] [dummyIndex][0] starting recovery from local ...
[2011-04-20 05:28:41,693][DEBUG][index.gateway ] [Hellstrom, Patsy] [dummyIndex][0] recovery completed from local, took [6ms]
index : files [0] with total_size [0b], took[0s]
: recovered_files [0] with total_size [0b]
: reusing_files [0] with total_size [0b]
translog : number_of_operations [0], took [5ms]
[2011-04-20 05:28:41,713][DEBUG][index.gateway ] [Hellstrom, Patsy] [dummyIndex][1] starting recovery from local ...
[2011-04-20 05:28:41,719][DEBUG][index.gateway ] [Hellstrom, Patsy] [dummyIndex][1] recovery completed from local, took [6ms]
index : files [0] with total_size [0b], took[0s]
: recovered_files [0] with total_size [0b]
: reusing_files [0] with total_size [0b]
translog : number_of_operations [0], took [5ms]
[2011-04-20 05:28:41,739][DEBUG][index.gateway ] [Hellstrom, Patsy] [dummyIndex][2] starting recovery from local ...
[2011-04-20 05:28:41,744][DEBUG][index.gateway ] [Hellstrom, Patsy] [dummyIndex][2] recovery completed from local, took [5ms]
index : files [0] with total_size [0b], took[0s]
: recovered_files [0] with total_size [0b]
: reusing_files [0] with total_size [0b]
translog : number_of_operations [0], took [4ms]
[2011-04-20 05:28:41,821][INFO ][cluster.metadata ] [Hellstrom, Patsy] [dummyIndex] update_mapping [mail] (dynamic)

Then I index some data into it and shutdown the cluster (node is using the default local gateway).
Later I start it again but one shard is not recovered, why?

[2011-04-20 05:37:27,050][DEBUG][gateway.local ] [Vashti] [dummyIndex][0]: throttling allocation [[dummyIndex][0], node[null], [P], s[UNASSIGNED]] to [[Vashti][X0Hk2SAVTfOBjVUcMAPOCQ][inet[/10.16.88.223:9300]]] on primary allocation
[2011-04-20 05:37:27,051][DEBUG][gateway.local ] [Vashti] [dummyIndex][1]: throttling allocation [[dummyIndex][1], node[null], [P], s[UNASSIGNED]] to [[Vashti][X0Hk2SAVTfOBjVUcMAPOCQ][inet[/10.16.88.223:9300]]] on primary allocation
[2011-04-20 05:37:27,051][DEBUG][gateway.local ] [Vashti] [dummyIndex][2]: not allocating, number_of_allocated_shards_found [0], required_number [1]

As a side note, there are more indices created the same way on this node (say ~8 other indices) and they are always correctly recovered, only this specific index is not. May this has to do something with the fact that when I am indexing data into the cluster then some threads on my client are hanging (not sure if these threads are indexing into dummyIndex, will try to find out), however, even if the threads are hanging I see the status on all indices are yellow, not red, before the shudown.

Regards,
Lukas

Absolutely, it was yellow for all shards. Strange... I can simulate this on
my local machine but probably it would be very hard to provide recreation
script :frowning:

On Wed, Apr 20, 2011 at 12:11 PM, Shay Banon
shay.banon@elasticsearch.comwrote:

If the status is yellow, then those indices should be recovered. What we
can see is that there is a missing shard so its not being recovered. Are you
sure that it was yellow before you shut it down?

On Wednesday, April 20, 2011 at 12:53 PM, Lukáš Vlček wrote:

Hi,

I see strange thing with the latest master. (The below log output can be
found also here https://gist.github.com/930879)

I create a new index (lets call it a "dummyIndex"), 3 shards, 1 replica.
Because the cluster has only one node then the status is always yellow, that
is fine.

[2011-04-20 05:28:41,658][INFO ][cluster.metadata ] [Hellstrom,
Patsy] [dummyIndex] creating index, cause [auto(index api)], shards [3]/[1],
mappings [mail]
[2011-04-20 05:28:41,688][DEBUG][index.gateway ] [Hellstrom,
Patsy] [dummyIndex][0] starting recovery from local ...
[2011-04-20 05:28:41,693][DEBUG][index.gateway ] [Hellstrom,
Patsy] [dummyIndex][0] recovery completed from local, took [6ms]
index : files [0] with total_size [0b], took[0s]
: recovered_files [0] with total_size [0b]
: reusing_files [0] with total_size [0b]
translog : number_of_operations [0], took [5ms]
[2011-04-20 05:28:41,713][DEBUG][index.gateway ] [Hellstrom,
Patsy] [dummyIndex][1] starting recovery from local ...
[2011-04-20 05:28:41,719][DEBUG][index.gateway ] [Hellstrom,
Patsy] [dummyIndex][1] recovery completed from local, took [6ms]
index : files [0] with total_size [0b], took[0s]
: recovered_files [0] with total_size [0b]
: reusing_files [0] with total_size [0b]
translog : number_of_operations [0], took [5ms]
[2011-04-20 05:28:41,739][DEBUG][index.gateway ] [Hellstrom,
Patsy] [dummyIndex][2] starting recovery from local ...
[2011-04-20 05:28:41,744][DEBUG][index.gateway ] [Hellstrom,
Patsy] [dummyIndex][2] recovery completed from local, took [5ms]
index : files [0] with total_size [0b], took[0s]
: recovered_files [0] with total_size [0b]
: reusing_files [0] with total_size [0b]
translog : number_of_operations [0], took [4ms]
[2011-04-20 05:28:41,821][INFO ][cluster.metadata ] [Hellstrom,
Patsy] [dummyIndex] update_mapping [mail] (dynamic)

Then I index some data into it and shutdown the cluster (node is using the
default local gateway).
Later I start it again but one shard is not recovered, why?

[2011-04-20 05:37:27,050][DEBUG][gateway.local ] [Vashti]
[dummyIndex][0]: throttling allocation [[dummyIndex][0], node[null], [P],
s[UNASSIGNED]] to [[Vashti][X0Hk2SAVTfOBjVUcMAPOCQ][inet[/10.16.88.223:9300]]]
on primary allocation
[2011-04-20 05:37:27,051][DEBUG][gateway.local ] [Vashti]
[dummyIndex][1]: throttling allocation [[dummyIndex][1], node[null], [P],
s[UNASSIGNED]] to [[Vashti][X0Hk2SAVTfOBjVUcMAPOCQ][inet[/10.16.88.223:9300]]]
on primary allocation
[2011-04-20 05:37:27,051][DEBUG][gateway.local ] [Vashti]
[dummyIndex][2]: not allocating, number_of_allocated_shards_found [0],
required_number [1]

As a side note, there are more indices created the same way on this node
(say ~8 other indices) and they are always correctly recovered, only this
specific index is not. May this has to do something with the fact that when
I am indexing data into the cluster then some threads on my client are
hanging (not sure if these threads are indexing into dummyIndex, will try to
find out), however, even if the threads are hanging I see the status on all
indices are yellow, not red, before the shudown.

Regards,
Lukas