Shards remain "unassigned " after server restart


(Eran Kutner-2) #1

Hi,
I've restarted a server in a cluster. After that the sards that used to be
on that server remain unassigned.
I tried to shut it down completly so the master will relocate those shards
but it doesn't. The master logs says something like the following for each
unassigned shard:
[2012-07-15 06:32:48,021][DEBUG][cluster.routing.allocation] [es1-02]
[6090122_m][1] allocated on
[[es1-03][a6JkCp4jRJ2ccRl9wxNRvw][inet[/10.1.101.153:9300]]{data_center=ny4}],
but can no longer be allocated on it, moving...
[2012-07-15 06:32:48,021][DEBUG][cluster.routing.allocation] [es1-02]
[6090122_m][1] can't move
[2012-07-15 06:32:48,022][DEBUG][cluster.routing.allocation] [es1-02]
[6129251_s][1] allocated on
[[es1-02][-EURPWy_RqicD_9koodbVA][inet[/10.1.101.152:9300]]{data_center=ny4}],
but can no longer be allocated on it, moving...
[2012-07-15 06:32:48,022][DEBUG][cluster.routing.allocation] [es1-02]
[6129251_s][1] can't move

Until eventually it say:
[2012-07-15 06:45:48,972][DEBUG][cluster.service ] [es1-02]
processing [routing-table-updater]: no change in cluster_state

Note that the server I shutdown is es1-01 not the ones the log is taking
about.

-eran


(Shay Banon) #2

Can you share your special configuration in terms of allocation? And
possibly the full log?

On Sun, Jul 15, 2012 at 12:48 PM, Eran eran@gigya-inc.com wrote:

Hi,
I've restarted a server in a cluster. After that the sards that used to be
on that server remain unassigned.
I tried to shut it down completly so the master will relocate those shards
but it doesn't. The master logs says something like the following for each
unassigned shard:
[2012-07-15 06:32:48,021][DEBUG][cluster.routing.allocation] [es1-02]
[6090122_m][1] allocated on
[[es1-03][a6JkCp4jRJ2ccRl9wxNRvw][inet[/10.1.101.153:9300]]{data_center=ny4}],
but can no longer be allocated on it, moving...
[2012-07-15 06:32:48,021][DEBUG][cluster.routing.allocation] [es1-02]
[6090122_m][1] can't move
[2012-07-15 06:32:48,022][DEBUG][cluster.routing.allocation] [es1-02]
[6129251_s][1] allocated on
[[es1-02][-EURPWy_RqicD_9koodbVA][inet[/10.1.101.152:9300]]{data_center=ny4}],
but can no longer be allocated on it, moving...
[2012-07-15 06:32:48,022][DEBUG][cluster.routing.allocation] [es1-02]
[6129251_s][1] can't move

Until eventually it say:
[2012-07-15 06:45:48,972][DEBUG][cluster.service ] [es1-02]
processing [routing-table-updater]: no change in cluster_state

Note that the server I shutdown is es1-01 not the ones the log is taking
about.

-eran


(Eran Kutner-2) #3

Here is my config: https://gist.github.com/3121344
All the servers in the cluster are now local, and connected to the same
high speed network segment (no more remote servers in a secondary data
center).
The logs are pretty large so i'm not sure uploading all of them will be
helpful. Is there anything in particular I should be looking for?

To give more context on the shard allocation/initialization problem, it
seems to be random, every time I restart a server (or the entire cluster) a
different set of shards will be loaded and a few will remain unallocated or
stuck in "initializing" state. Once in a few tries it actually manages to
load all of them. The version of this cluster is 0.19.7 but I've had the
same problem with a 0.19.8 cluster. In the other cluster case I didn't even
restart it, just started with a clean cluster and pushed data (also
creating a lot of indexes in the process), after a while it turned
red/yellow and remained stuck on yellow, until I restarted the cluster a
few times and then it managed to load the all the shards. Usually once all
the shards have been loaded they remain stable.

By the way, when I came in this morning one of the nodes decided it
couldn't discover the master. I'm certain that if I restart that server it
will discover the master just fine.

-eran

On Monday, July 16, 2012 1:57:39 AM UTC+3, kimchy wrote:

Can you share your special configuration in terms of allocation? And
possibly the full log?

On Sun, Jul 15, 2012 at 12:48 PM, Eran wrote:

Hi,
I've restarted a server in a cluster. After that the sards that used to
be on that server remain unassigned.
I tried to shut it down completly so the master will relocate those
shards but it doesn't. The master logs says something like the following
for each unassigned shard:
[2012-07-15 06:32:48,021][DEBUG][cluster.routing.allocation] [es1-02]
[6090122_m][1] allocated on
[[es1-03][a6JkCp4jRJ2ccRl9wxNRvw][inet[/10.1.101.153:9300]]{data_center=ny4}],
but can no longer be allocated on it, moving...
[2012-07-15 06:32:48,021][DEBUG][cluster.routing.allocation] [es1-02]
[6090122_m][1] can't move
[2012-07-15 06:32:48,022][DEBUG][cluster.routing.allocation] [es1-02]
[6129251_s][1] allocated on
[[es1-02][-EURPWy_RqicD_9koodbVA][inet[/10.1.101.152:9300]]{data_center=ny4}],
but can no longer be allocated on it, moving...
[2012-07-15 06:32:48,022][DEBUG][cluster.routing.allocation] [es1-02]
[6129251_s][1] can't move

Until eventually it say:
[2012-07-15 06:45:48,972][DEBUG][cluster.service ] [es1-02]
processing [routing-table-updater]: no change in cluster_state

Note that the server I shutdown is es1-01 not the ones the log is taking
about.

-eran


(Eran Kutner-2) #4

This is still happening. No matter what I try (and I've tried quite a few
things), I can't get the cluster to load all the shards. The best I could
get to was 4 shards that we're left unassigned.

The logs show a lot of things, don't know which ones are relevant so I'll
just post a sample of the messages I see, but keep in mind that there are
many more similar messages in the log:
[2012-07-19 16:21:59,867][DEBUG][gateway.local ] [es1-02]
[6144971_xs][0]: throttling allocation [[6144971_xs][0], node[null], [R],
s[UNASSIGNED]] to
[[es1-04][opCgkyHhRmaX47hMysvw9Q][inet[/10.1.101.154:9300]]{data_center=ny4,
max_local_storage_nodes=1, master=true}] in order to reuse its unallocated
persistent store with total_size [615.2kb]
[2012-07-19 16:21:59,867][TRACE][gateway.local ] [es1-02]
[6087422_s][1], node[null], [R], s[UNASSIGNED]: checking node
[[es1-03][D7zFpmccSbmObxfuYT0W7A][inet[/10.1.101.153:9300]]{data_center=ny4,
max_local_storage_nodes=1, master=true}]
[2012-07-19 16:21:59,868][TRACE][gateway.local ] [es1-02]
[6087422_s][1], node[null], [R], s[UNASSIGNED]: checking node
[[es1-02][LGPVC4ZDSz6ZP9M5GVTY5Q][inet[/10.1.101.152:9300]]{data_center=ny4,
max_local_storage_nodes=1, master=true}]
[2012-07-19 16:21:59,868][TRACE][gateway.local ] [es1-02]
[6087422_s][1], node[null], [R], s[UNASSIGNED]: checking node
[[es1-01][e_SR_BfIQnKAljzB7DjgLA][inet[/10.1.101.151:9300]]{data_center=ny4,
max_local_storage_nodes=1, master=true}]
[2012-07-19 16:21:59,868][TRACE][gateway.local ] [es1-02]
[6087422_s][1], node[null], [R], s[UNASSIGNED]: checking node
[[es1-04][opCgkyHhRmaX47hMysvw9Q][inet[/10.1.101.154:9300]]{data_center=ny4,
max_local_storage_nodes=1, master=true}]
[2012-07-19 16:21:59,869][DEBUG][gateway.local ] [es1-02]
[6087422_s][1]: throttling allocation [[6087422_s][1], node[null], [R],
s[UNASSIGNED]] to
[[es1-04][opCgkyHhRmaX47hMysvw9Q][inet[/10.1.101.154:9300]]{data_center=ny4,
max_local_storage_nodes=1, master=true}] in order to reuse its unallocated
persistent store with total_size [177.2kb]

[2012-07-19 16:30:57,328][DEBUG][indices.memory ] [es1-02]
marking shard [6090102_m][2] as inactive (inactive_time[30m]) indexing
wise, setting size to [500kb]
[2012-07-19 16:30:57,330][DEBUG][indices.memory ] [es1-02]
marking shard [6090172_m][1] as inactive (inactive_time[30m]) indexing
wise, setting size to [500kb]
[2012-07-19 16:30:57,330][DEBUG][indices.memory ] [es1-02]
marking shard [6173411_s][0] as inactive (inactive_time[30m]) indexing
wise, setting size to [500kb]
[2012-07-19 16:30:57,330][DEBUG][indices.memory ] [es1-02]
marking shard [6090032_m][1] as inactive (inactive_time[30m]) indexing
wise, setting size to [500kb]
[2012-07-19 16:30:57,331][DEBUG][indices.memory ] [es1-02]
marking shard [6190951_l][4] as inactive (inactive_time[30m]) indexing
wise, setting size to [500kb]
[2012-07-19 16:30:57,331][DEBUG][indices.memory ] [es1-02]
marking shard [6186901_m][2] as inactive (inactive_time[30m]) indexing
wise, setting size to [500kb]
[2012-07-19 16:30:57,331][DEBUG][indices.memory ] [es1-02]
marking shard [6181771_m][4] as inactive (inactive_time[30m]) indexing
wise, setting size to [500kb]
[2012-07-19 16:30:57,333][DEBUG][index.engine.robin ] [es1-02]
[6090102_m][2] using bloom filter enhanced delete handling
[2012-07-19 16:30:57,336][DEBUG][index.engine.robin ] [es1-02]
[6090172_m][1] using bloom filter enhanced delete handling
[2012-07-19 16:30:57,341][DEBUG][index.engine.robin ] [es1-02]
[6173411_s][0] using bloom filter enhanced delete handling
[2012-07-19 16:30:57,354][DEBUG][index.engine.robin ] [es1-02]
[6090032_m][1] using bloom filter enhanced delete handling
[2012-07-19 16:30:57,360][DEBUG][index.engine.robin ] [es1-02]
[6190951_l][4] using bloom filter enhanced delete handling
[2012-07-19 16:30:57,362][DEBUG][index.engine.robin ] [es1-02]
[6186901_m][2] using bloom filter enhanced delete handling
[2012-07-19 16:30:57,364][DEBUG][index.engine.robin ] [es1-02]
[6181771_m][4] using bloom filter enhanced delete handling
[2012-07-19 16:30:57,365][DEBUG][indices.memory ] [es1-02]
recalculating shard indexing buffer (reason=shards became active/inactive
(indexing wise)), total is [5.9gb] with [61] active shards, each shard set
to [100.6mb]
[2012-07-19 16:31:27,366][DEBUG][indices.memory ] [es1-02]
marking shard [6090112_m][3] as active indexing wise
[2012-07-19 16:31:27,367][DEBUG][indices.memory ] [es1-02]
recalculating shard indexing buffer (reason=shards became active/inactive
(indexing wise)), total is [5.9gb] with [62] active shards, each shard set
to [99mb]
[2012-07-19 16:31:57,368][DEBUG][indices.memory ] [es1-02]
marking shard [6190951_l][1] as inactive (inactive_time[30m]) indexing
wise, setting size to [500kb]
[2012-07-19 16:31:57,371][DEBUG][index.engine.robin ] [es1-02]
[6190951_l][1] using bloom filter enhanced delete handling
[2012-07-19 16:31:57,372][DEBUG][indices.memory ] [es1-02]
recalculating shard indexing buffer (reason=shards became active/inactive
(indexing wise)), total is [5.9gb] with [61] active shards, each shard set
to [100.6mb]

-eran

On Monday, July 16, 2012 10:38:15 AM UTC+3, Eran wrote:

Here is my config: https://gist.github.com/3121344
All the servers in the cluster are now local, and connected to the same
high speed network segment (no more remote servers in a secondary data
center).
The logs are pretty large so i'm not sure uploading all of them will be
helpful. Is there anything in particular I should be looking for?

To give more context on the shard allocation/initialization problem, it
seems to be random, every time I restart a server (or the entire cluster) a
different set of shards will be loaded and a few will remain unallocated or
stuck in "initializing" state. Once in a few tries it actually manages to
load all of them. The version of this cluster is 0.19.7 but I've had the
same problem with a 0.19.8 cluster. In the other cluster case I didn't even
restart it, just started with a clean cluster and pushed data (also
creating a lot of indexes in the process), after a while it turned
red/yellow and remained stuck on yellow, until I restarted the cluster a
few times and then it managed to load the all the shards. Usually once all
the shards have been loaded they remain stable.

By the way, when I came in this morning one of the nodes decided it
couldn't discover the master. I'm certain that if I restart that server it
will discover the master just fine.

-eran

On Monday, July 16, 2012 1:57:39 AM UTC+3, kimchy wrote:

Can you share your special configuration in terms of allocation? And
possibly the full log?

On Sun, Jul 15, 2012 at 12:48 PM, Eran wrote:

Hi,
I've restarted a server in a cluster. After that the sards that used to
be on that server remain unassigned.
I tried to shut it down completly so the master will relocate those
shards but it doesn't. The master logs says something like the following
for each unassigned shard:
[2012-07-15 06:32:48,021][DEBUG][cluster.routing.allocation] [es1-02]
[6090122_m][1] allocated on
[[es1-03][a6JkCp4jRJ2ccRl9wxNRvw][inet[/10.1.101.153:9300]]{data_center=ny4}],
but can no longer be allocated on it, moving...
[2012-07-15 06:32:48,021][DEBUG][cluster.routing.allocation] [es1-02]
[6090122_m][1] can't move
[2012-07-15 06:32:48,022][DEBUG][cluster.routing.allocation] [es1-02]
[6129251_s][1] allocated on
[[es1-02][-EURPWy_RqicD_9koodbVA][inet[/10.1.101.152:9300]]{data_center=ny4}],
but can no longer be allocated on it, moving...
[2012-07-15 06:32:48,022][DEBUG][cluster.routing.allocation] [es1-02]
[6129251_s][1] can't move

Until eventually it say:
[2012-07-15 06:45:48,972][DEBUG][cluster.service ] [es1-02]
processing [routing-table-updater]: no change in cluster_state

Note that the server I shutdown is es1-01 not the ones the log is taking
about.

-eran


(Eran Kutner-2) #5

After enough restarts I was able to get the cluster to load all shards and
go to "green". But something is definitely messed up with how ES is loading
shards after a restart.
I've uploaded a gist of all the server logs for one shard that refused to
load for a few attempts. These logs represent multiple server restarts but
they do show some bizarre errors.

On Thursday, July 19, 2012 11:32:24 PM UTC+3, Eran wrote:

This is still happening. No matter what I try (and I've tried quite a few
things), I can't get the cluster to load all the shards. The best I could
get to was 4 shards that we're left unassigned.

The logs show a lot of things, don't know which ones are relevant so I'll
just post a sample of the messages I see, but keep in mind that there are
many more similar messages in the log:
[2012-07-19 16:21:59,867][DEBUG][gateway.local ] [es1-02]
[6144971_xs][0]: throttling allocation [[6144971_xs][0], node[null], [R],
s[UNASSIGNED]] to
[[es1-04][opCgkyHhRmaX47hMysvw9Q][inet[/10.1.101.154:9300]]{data_center=ny4,
max_local_storage_nodes=1, master=true}] in order to reuse its unallocated
persistent store with total_size [615.2kb]
[2012-07-19 16:21:59,867][TRACE][gateway.local ] [es1-02]
[6087422_s][1], node[null], [R], s[UNASSIGNED]: checking node
[[es1-03][D7zFpmccSbmObxfuYT0W7A][inet[/10.1.101.153:9300]]{data_center=ny4,
max_local_storage_nodes=1, master=true}]
[2012-07-19 16:21:59,868][TRACE][gateway.local ] [es1-02]
[6087422_s][1], node[null], [R], s[UNASSIGNED]: checking node
[[es1-02][LGPVC4ZDSz6ZP9M5GVTY5Q][inet[/10.1.101.152:9300]]{data_center=ny4,
max_local_storage_nodes=1, master=true}]
[2012-07-19 16:21:59,868][TRACE][gateway.local ] [es1-02]
[6087422_s][1], node[null], [R], s[UNASSIGNED]: checking node
[[es1-01][e_SR_BfIQnKAljzB7DjgLA][inet[/10.1.101.151:9300]]{data_center=ny4,
max_local_storage_nodes=1, master=true}]
[2012-07-19 16:21:59,868][TRACE][gateway.local ] [es1-02]
[6087422_s][1], node[null], [R], s[UNASSIGNED]: checking node
[[es1-04][opCgkyHhRmaX47hMysvw9Q][inet[/10.1.101.154:9300]]{data_center=ny4,
max_local_storage_nodes=1, master=true}]
[2012-07-19 16:21:59,869][DEBUG][gateway.local ] [es1-02]
[6087422_s][1]: throttling allocation [[6087422_s][1], node[null], [R],
s[UNASSIGNED]] to
[[es1-04][opCgkyHhRmaX47hMysvw9Q][inet[/10.1.101.154:9300]]{data_center=ny4,
max_local_storage_nodes=1, master=true}] in order to reuse its unallocated
persistent store with total_size [177.2kb]

[2012-07-19 16:30:57,328][DEBUG][indices.memory ] [es1-02]
marking shard [6090102_m][2] as inactive (inactive_time[30m]) indexing
wise, setting size to [500kb]
[2012-07-19 16:30:57,330][DEBUG][indices.memory ] [es1-02]
marking shard [6090172_m][1] as inactive (inactive_time[30m]) indexing
wise, setting size to [500kb]
[2012-07-19 16:30:57,330][DEBUG][indices.memory ] [es1-02]
marking shard [6173411_s][0] as inactive (inactive_time[30m]) indexing
wise, setting size to [500kb]
[2012-07-19 16:30:57,330][DEBUG][indices.memory ] [es1-02]
marking shard [6090032_m][1] as inactive (inactive_time[30m]) indexing
wise, setting size to [500kb]
[2012-07-19 16:30:57,331][DEBUG][indices.memory ] [es1-02]
marking shard [6190951_l][4] as inactive (inactive_time[30m]) indexing
wise, setting size to [500kb]
[2012-07-19 16:30:57,331][DEBUG][indices.memory ] [es1-02]
marking shard [6186901_m][2] as inactive (inactive_time[30m]) indexing
wise, setting size to [500kb]
[2012-07-19 16:30:57,331][DEBUG][indices.memory ] [es1-02]
marking shard [6181771_m][4] as inactive (inactive_time[30m]) indexing
wise, setting size to [500kb]
[2012-07-19 16:30:57,333][DEBUG][index.engine.robin ] [es1-02]
[6090102_m][2] using bloom filter enhanced delete handling
[2012-07-19 16:30:57,336][DEBUG][index.engine.robin ] [es1-02]
[6090172_m][1] using bloom filter enhanced delete handling
[2012-07-19 16:30:57,341][DEBUG][index.engine.robin ] [es1-02]
[6173411_s][0] using bloom filter enhanced delete handling
[2012-07-19 16:30:57,354][DEBUG][index.engine.robin ] [es1-02]
[6090032_m][1] using bloom filter enhanced delete handling
[2012-07-19 16:30:57,360][DEBUG][index.engine.robin ] [es1-02]
[6190951_l][4] using bloom filter enhanced delete handling
[2012-07-19 16:30:57,362][DEBUG][index.engine.robin ] [es1-02]
[6186901_m][2] using bloom filter enhanced delete handling
[2012-07-19 16:30:57,364][DEBUG][index.engine.robin ] [es1-02]
[6181771_m][4] using bloom filter enhanced delete handling
[2012-07-19 16:30:57,365][DEBUG][indices.memory ] [es1-02]
recalculating shard indexing buffer (reason=shards became active/inactive
(indexing wise)), total is [5.9gb] with [61] active shards, each shard set
to [100.6mb]
[2012-07-19 16:31:27,366][DEBUG][indices.memory ] [es1-02]
marking shard [6090112_m][3] as active indexing wise
[2012-07-19 16:31:27,367][DEBUG][indices.memory ] [es1-02]
recalculating shard indexing buffer (reason=shards became active/inactive
(indexing wise)), total is [5.9gb] with [62] active shards, each shard set
to [99mb]
[2012-07-19 16:31:57,368][DEBUG][indices.memory ] [es1-02]
marking shard [6190951_l][1] as inactive (inactive_time[30m]) indexing
wise, setting size to [500kb]
[2012-07-19 16:31:57,371][DEBUG][index.engine.robin ] [es1-02]
[6190951_l][1] using bloom filter enhanced delete handling
[2012-07-19 16:31:57,372][DEBUG][indices.memory ] [es1-02]
recalculating shard indexing buffer (reason=shards became active/inactive
(indexing wise)), total is [5.9gb] with [61] active shards, each shard set
to [100.6mb]

-eran

On Monday, July 16, 2012 10:38:15 AM UTC+3, Eran wrote:

Here is my config: https://gist.github.com/3121344
All the servers in the cluster are now local, and connected to the same
high speed network segment (no more remote servers in a secondary data
center).
The logs are pretty large so i'm not sure uploading all of them will be
helpful. Is there anything in particular I should be looking for?

To give more context on the shard allocation/initialization problem, it
seems to be random, every time I restart a server (or the entire cluster) a
different set of shards will be loaded and a few will remain unallocated or
stuck in "initializing" state. Once in a few tries it actually manages to
load all of them. The version of this cluster is 0.19.7 but I've had the
same problem with a 0.19.8 cluster. In the other cluster case I didn't even
restart it, just started with a clean cluster and pushed data (also
creating a lot of indexes in the process), after a while it turned
red/yellow and remained stuck on yellow, until I restarted the cluster a
few times and then it managed to load the all the shards. Usually once all
the shards have been loaded they remain stable.

By the way, when I came in this morning one of the nodes decided it
couldn't discover the master. I'm certain that if I restart that server it
will discover the master just fine.

-eran

On Monday, July 16, 2012 1:57:39 AM UTC+3, kimchy wrote:

Can you share your special configuration in terms of allocation? And
possibly the full log?

On Sun, Jul 15, 2012 at 12:48 PM, Eran wrote:

Hi,
I've restarted a server in a cluster. After that the sards that used to
be on that server remain unassigned.
I tried to shut it down completly so the master will relocate those
shards but it doesn't. The master logs says something like the following
for each unassigned shard:
[2012-07-15 06:32:48,021][DEBUG][cluster.routing.allocation] [es1-02]
[6090122_m][1] allocated on
[[es1-03][a6JkCp4jRJ2ccRl9wxNRvw][inet[/10.1.101.153:9300]]{data_center=ny4}],
but can no longer be allocated on it, moving...
[2012-07-15 06:32:48,021][DEBUG][cluster.routing.allocation] [es1-02]
[6090122_m][1] can't move
[2012-07-15 06:32:48,022][DEBUG][cluster.routing.allocation] [es1-02]
[6129251_s][1] allocated on
[[es1-02][-EURPWy_RqicD_9koodbVA][inet[/10.1.101.152:9300]]{data_center=ny4}],
but can no longer be allocated on it, moving...
[2012-07-15 06:32:48,022][DEBUG][cluster.routing.allocation] [es1-02]
[6129251_s][1] can't move

Until eventually it say:
[2012-07-15 06:45:48,972][DEBUG][cluster.service ] [es1-02]
processing [routing-table-updater]: no change in cluster_state

Note that the server I shutdown is es1-01 not the ones the log is
taking about.

-eran


(system) #6