Unassigned Shards

Hi,

After restarting a cluster, we see a very slow recovery of the shards.

Custer health:
{
"active_primary_shards": 173,
"active_shards": 173,
"cluster_name": "engagor",
"initializing_shards": 12,
"number_of_data_nodes": 3,
"number_of_nodes": 3,
"relocating_shards": 0,
"status": "red",
"timed_out": false,
"unassigned_shards": 13285
}

The unassigned shards get assigned quite slow. See the following gist
with our config (https://gist.github.com/1007741)

Is this normal? No data is actually corrupted. Is it possible to bring
up a cluster without the need to do recovery?

Thanks
Folke

I really need some help on this one.

Since the initial post, only an additional 600 shards have been
assigned (over the last hour).

This means it would take a total of 23 hours to init the whole
cluster? (this is basically slower than reindexing all our data)

I must be doing something wrong here...

Would using a shared fs gateway solve this?

Thanks in advance,
Folke

Which version are you using?

On Saturday, June 4, 2011 at 1:16 PM, Engagor wrote:

I really need some help on this one.

Since the initial post, only an additional 600 shards have been
assigned (over the last hour).

This means it would take a total of 23 hours to init the whole
cluster? (this is basically slower than reindexing all our data)

I must be doing something wrong here...

Would using a shared fs gateway solve this?

Thanks in advance,
Folke

Hi Shay,

I'm using Elasticsearch Version: 0.16.1 (2011-05-12T17:19:28), JVM:
20.0-b11

After enabling DEBUG logging on 1 node and restarting that node, I can
see that following in the logs on the master:
--------[shard7480-2011-1][1], node[null], [R], s[UNASSIGNED]
--------[shard7480-2011-1][1], node[null], [P], s[UNASSIGNED]
--------[shard7480-2011-1][2], node[null], [R], s[UNASSIGNED]
--------[shard7480-2011-1][2], node[null], [P], s[UNASSIGNED]
--------[shard8028-2011-5][0], node[null], [R], s[UNASSIGNED]
--------[shard8028-2011-5][0], node[null], [P], s[UNASSIGNED]
--------[shard8028-2011-5][1], node[null], [R], s[UNASSIGNED]
--------[shard8028-2011-5][1], node[null], [P], s[UNASSIGNED]
--------[shard8028-2011-5][2], node[null], [R], s[UNASSIGNED]

This is printed in the logs right after the node has joined the
master. After this, all the shards on the node are very slowly being
added.
Strangely enough, recovering a shard from local takes roughly 50ms.
It's the time in between that takes long. Seems there is a delay of
exactly 7 seconds between recovery of shards?

See some entries from the DEBUG logging from the node (not master
node):
[2011-06-04 14:30:14,883][DEBUG][index.gateway ] [Killer
Shrike] [shard294-2011-4][2] starting recovery from local ...
[2011-06-04 14:30:14,886][DEBUG][index.gateway ] [Killer
Shrike] [shard294-2011-4][2] recovery completed from local, took [2ms]
index : files [11] with total_size [425.3kb],
took[1ms]
: recovered_files [0] with total_size [0b]
: reusing_files [11] with total_size [425.3kb]
translog : number_of_operations [0], took [2ms]
[2011-06-04 14:30:21,806][DEBUG][index.gateway ] [Killer
Shrike] [shard294-2011-3][0] starting recovery from local ...
[2011-06-04 14:30:21,809][DEBUG][index.gateway ] [Killer
Shrike] [shard294-2011-3][0] recovery completed from local, took [3ms]
index : files [20] with total_size [475.5kb],
took[0s]
: recovered_files [0] with total_size [0b]
: reusing_files [20] with total_size [475.5kb]
translog : number_of_operations [0], took [2ms]
[2011-06-04 14:30:28,249][DEBUG][index.gateway ] [Killer
Shrike] [shard294-2011-3][1] starting recovery from local ...
[2011-06-04 14:30:28,251][DEBUG][index.gateway ] [Killer
Shrike] [shard294-2011-3][1] recovery completed from local, took [2ms]
index : files [12] with total_size [505.6kb],
took[0s]
: recovered_files [0] with total_size [0b]
: reusing_files [12] with total_size [505.6kb]
translog : number_of_operations [0], took [2ms]

Thanks
Folke

On Jun 4, 2:19 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Which version are you using?

On Saturday, June 4, 2011 at 1:16 PM, Engagor wrote:

I really need some help on this one.

Since the initial post, only an additional 600 shards have been
assigned (over the last hour).

This means it would take a total of 23 hours to init the whole
cluster? (this is basically slower than reindexing all our data)

I must be doing something wrong here...

Would using a shared fs gateway solve this?

Thanks in advance,
Folke

Throtteling seems to be the issue I'm having. See the following Gist
for debug logs from the master: https://gist.github.com/1007894

The logs get spammed very fast with these kinds of entries.

Is there a setting I should change here?

Thanks in advance
Folke

By default, it will throttle 4 concurrent primary allocation per node (which is the important one you wan to get to as fast as possible). You can set: cluster.routing.allocation.node_initial_primaries_recoveries to a higher value and it will cause more shards to be allocated concurrently.

This throttling is done so a machine will not be overloaded, it might make sense in your case to have a higher value.

On Saturday, June 4, 2011 at 4:11 PM, Engagor wrote:

Throtteling seems to be the issue I'm having. See the following Gist
for debug logs from the master: gist:1007894 · GitHub

The logs get spammed very fast with these kinds of entries.

Is there a setting I should change here?

Thanks in advance
Folke

Also, how many indices do you have? For such small indices, make sure to just allocate one shard per index (you have a lot of shards).

On Saturday, June 4, 2011 at 5:05 PM, Shay Banon wrote:

By default, it will throttle 4 concurrent primary allocation per node (which is the important one you wan to get to as fast as possible). You can set: cluster.routing.allocation.node_initial_primaries_recoveries to a higher value and it will cause more shards to be allocated concurrently.

This throttling is done so a machine will not be overloaded, it might make sense in your case to have a higher value.

On Saturday, June 4, 2011 at 4:11 PM, Engagor wrote:

Throtteling seems to be the issue I'm having. See the following Gist
for debug logs from the master: gist:1007894 · GitHub

The logs get spammed very fast with these kinds of entries.

Is there a setting I should change here?

Thanks in advance
Folke

We currently have roughly 4000 indices with 3 shards each. Eeach index
will contain approx 100k docs.

I have been testing with the setting for
node_initial_primaries_recoveries. Checking the health then, only
makes a difference with initializing nodes:
{
"active_primary_shards": 404,
"active_shards": 404,
"cluster_name": "engagor",
"initializing_shards": 300,
"number_of_data_nodes": 3,
"number_of_nodes": 3,
"relocating_shards": 0,
"status": "red",
"timed_out": false,
"unassigned_shards": 12766
}

Shards still become active very slowly (roughly once every 5 to 7
seconds accross the three nodes)

In the logs there seems to be an issue with starting shards. (bouncing
back and forth between master / node)

See master logs for 1 specific index/shard:
[2011-06-04 16:42:20,289][DEBUG][cluster.action.shard ] [Washout]
received shard started for [shard3309-2011-3][0],
node[xIB_TnLmRX27a2m7U4n9kA], [P], s[INITIALIZING], reason [master
[Washout][xR0LOX4eS2WUnsoUFmfssg][inet[/10.10.10.2:9300]] marked shard
as initializing, but shard already started, mark shard as started]
[2011-06-04 16:42:25,044][DEBUG][cluster.action.shard ] [Washout]
received shard started for [shard3309-2011-3][0],
node[xIB_TnLmRX27a2m7U4n9kA], [P], s[INITIALIZING], reason [master
[Washout][xR0LOX4eS2WUnsoUFmfssg][inet[/10.10.10.2:9300]] marked shard
as initializing, but shard already started, mark shard as started]
[2011-06-04 16:42:29,966][DEBUG][cluster.action.shard ] [Washout]
received shard started for [shard3309-2011-3][0],
node[xIB_TnLmRX27a2m7U4n9kA], [P], s[INITIALIZING], reason [master
[Washout][xR0LOX4eS2WUnsoUFmfssg][inet[/10.10.10.2:9300]] marked shard
as initializing, but shard already started, mark shard as started]
[2011-06-04 16:42:34,770][DEBUG][cluster.action.shard ] [Washout]
received shard started for [shard3309-2011-3][0],
node[xIB_TnLmRX27a2m7U4n9kA], [P], s[INITIALIZING], reason [master
[Washout][xR0LOX4eS2WUnsoUFmfssg][inet[/10.10.10.2:9300]] marked shard
as initializing, but shard already started, mark shard as started]

For the same index/shard on one of the nodes:
[2011-06-04 16:42:13,064][DEBUG][cluster.action.shard ] [Vishanti]
sending shard started for [shard3309-2011-3][0],
node[xIB_TnLmRX27a2m7U4n9kA], [P], s[INITIALIZING], reason [master
[Washout][xR0LOX4eS2WUnsoUFmfssg][inet[/10.10.10.2:9300]] marked shard
as initializing, but shard already started, mark shard as started]
[2011-06-04 16:42:17,897][DEBUG][cluster.action.shard ] [Vishanti]
sending shard started for [shard3309-2011-3][0],
node[xIB_TnLmRX27a2m7U4n9kA], [P], s[INITIALIZING], reason [master
[Washout][xR0LOX4eS2WUnsoUFmfssg][inet[/10.10.10.2:9300]] marked shard
as initializing, but shard already started, mark shard as started]
[2011-06-04 16:42:22,637][DEBUG][cluster.action.shard ] [Vishanti]
sending shard started for [shard3309-2011-3][0],
node[xIB_TnLmRX27a2m7U4n9kA], [P], s[INITIALIZING], reason [master
[Washout][xR0LOX4eS2WUnsoUFmfssg][inet[/10.10.10.2:9300]] marked shard
as initializing, but shard already started, mark shard as started]
[2011-06-04 16:42:27,574][DEBUG][cluster.action.shard ] [Vishanti]
sending shard started for [shard3309-2011-3][0],
node[xIB_TnLmRX27a2m7U4n9kA], [P], s[INITIALIZING], reason [master
[Washout][xR0LOX4eS2WUnsoUFmfssg][inet[/10.10.10.2:9300]] marked shard
as initializing, but shard already started, mark shard as started]
[2011-06-04 16:42:32,378][DEBUG][cluster.action.shard ] [Vishanti]
sending shard started for [shard3309-2011-3][0],
node[xIB_TnLmRX27a2m7U4n9kA], [P], s[INITIALIZING], reason [master
[Washout][xR0LOX4eS2WUnsoUFmfssg][inet[/10.10.10.2:9300]] marked shard
as initializing, but shard already started, mark shard as started]

On Jun 4, 4:07 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Also, how many indices do you have? For such small indices, make sure to just allocate one shard per index (you have a lot of shards).

On Saturday, June 4, 2011 at 5:05 PM, Shay Banon wrote:

By default, it will throttle 4 concurrent primary allocation per node (which is the important one you wan to get to as fast as possible). You can set: cluster.routing.allocation.node_initial_primaries_recoveries to a higher value and it will cause more shards to be allocated concurrently.

This throttling is done so a machine will not be overloaded, it might make sense in your case to have a higher value.

On Saturday, June 4, 2011 at 4:11 PM, Engagor wrote:

Throtteling seems to be the issue I'm having. See the following Gist
for debug logs from the master:gist:1007894 · GitHub

The logs get spammed very fast with these kinds of entries.

Is there a setting I should change here?

Thanks in advance
Folke

Think I finally figured out the issue.

The reason for the very slow recovery and the bouncing back and forth
of the "started" messages is due to:
InternalClusterService.updateTasksExecutor

The executer is declared as
newSingleThreadExecutor(daemonThreadFactory(settings,
"clusterService#updateTask"));

Since this is single threaded on the master, it is simply not fast
enough to cope with incoming changes.

Events that are processed here that take multiple seconds are in my
case:

  • processing [routing-table-updater]
  • shard started (after recovery from gateway)

Is there any way to work around this? Could I easily change the code
and drop in a multi threaded executer instead?

Thanks
Folke

I've made some adjustments to the ES code and this is indeed the
issue. After adding some logging on updateTasksQueue size, this is the
result (after running for a few minutes):
[[2011-06-04 18:24:20,140][DEBUG][cluster.service ] [Korath
the Pursuer] updateTasksQueue: 73689
[2011-06-04 18:24:20,140][DEBUG][cluster.service ] [Korath
the Pursuer] updateTasksQueue: 73690
[2011-06-04 18:24:20,140][DEBUG][cluster.service ] [Korath
the Pursuer] updateTasksQueue: 73690
[2011-06-04 18:24:20,140][DEBUG][cluster.service ] [Korath
the Pursuer] updateTasksQueue: 73692
[2011-06-04 18:24:20,140][DEBUG][cluster.service ] [Korath
the Pursuer] updateTasksQueue: 73692
[2011-06-04 18:24:20,140][DEBUG][cluster.service ] [Korath
the Pursuer] updateTasksQueue: 73694
[2011-06-04 18:24:20,140][DEBUG][cluster.service ] [Korath
the Pursuer] updateTasksQueue: 73694
[2011-06-04 18:24:20,140][DEBUG][cluster.service ] [Korath
the Pursuer] updateTasksQueue: 73696
[2011-06-04 18:24:20,140][DEBUG][cluster.service ] [Korath
the Pursuer] updateTasksQueue: 73697

On Jun 4, 5:46 pm, Engagor folke.lemai...@gmail.com wrote:

Think I finally figured out the issue.

The reason for the very slow recovery and the bouncing back and forth
of the "started" messages is due to:
InternalClusterService.updateTasksExecutor

The executer is declared as
newSingleThreadExecutor(daemonThreadFactory(settings,
"clusterService#updateTask"));

Since this is single threaded on the master, it is simply not fast
enough to cope with incoming changes.

Events that are processed here that take multiple seconds are in my
case:

  • processing [routing-table-updater]
  • shard started (after recovery from gateway)

Is there any way to work around this? Could I easily change the code
and drop in a multi threaded executer instead?

Thanks
Folke

Hey,

This update cluster state handler is single threaded by design. The idea is that there is no need to worry about multi threaded change to changes happening in the cluster (shards starting, node coming and going). Having to worry about multi threaded changes to the same cluster state, and maintaining its immutable design is really not possible (it was like that in the initial design of elasticsearch, before 0.4 was released, and proved very problematic).

But, the idea is that those changes are happening all in memory, with easily cacheable state from other nodes if needed. I suspect that whats happening now is that there is a slew of changes happening cause today, once a shard has started (reported from another node), it asks for recalculation of the cluster state. If we batch it, we can considerably reduce this changes required.

I will try and first simulate what you are having, and see where things can be improved. Can you open an issue for this?

-shay.banon

On Saturday, June 4, 2011 at 7:26 PM, Engagor wrote:

I've made some adjustments to the ES code and this is indeed the
issue. After adding some logging on updateTasksQueue size, this is the
result (after running for a few minutes):
[[2011-06-04 18:24:20,140][DEBUG][cluster.service ] [Korath
the Pursuer] updateTasksQueue: 73689
[2011-06-04 18:24:20,140][DEBUG][cluster.service ] [Korath
the Pursuer] updateTasksQueue: 73690
[2011-06-04 18:24:20,140][DEBUG][cluster.service ] [Korath
the Pursuer] updateTasksQueue: 73690
[2011-06-04 18:24:20,140][DEBUG][cluster.service ] [Korath
the Pursuer] updateTasksQueue: 73692
[2011-06-04 18:24:20,140][DEBUG][cluster.service ] [Korath
the Pursuer] updateTasksQueue: 73692
[2011-06-04 18:24:20,140][DEBUG][cluster.service ] [Korath
the Pursuer] updateTasksQueue: 73694
[2011-06-04 18:24:20,140][DEBUG][cluster.service ] [Korath
the Pursuer] updateTasksQueue: 73694
[2011-06-04 18:24:20,140][DEBUG][cluster.service ] [Korath
the Pursuer] updateTasksQueue: 73696
[2011-06-04 18:24:20,140][DEBUG][cluster.service ] [Korath
the Pursuer] updateTasksQueue: 73697

On Jun 4, 5:46 pm, Engagor <folke.lemai...@gmail.com (http://gmail.com)> wrote:

Think I finally figured out the issue.

The reason for the very slow recovery and the bouncing back and forth
of the "started" messages is due to:
InternalClusterService.updateTasksExecutor

The executer is declared as
newSingleThreadExecutor(daemonThreadFactory(settings,
"clusterService#updateTask"));

Since this is single threaded on the master, it is simply not fast
enough to cope with incoming changes.

Events that are processed here that take multiple seconds are in my
case:

  • processing [routing-table-updater]
  • shard started (after recovery from gateway)

Is there any way to work around this? Could I easily change the code
and drop in a multi threaded executer instead?

Thanks
Folke