Interesting: the flapping node had equilibrated with two empty shards
and two full ones:
hoolock-data_esnode-9 /mnt$ du -mc /mnt/elasticsearch/data/hoolock/
nodes/0/indices/10//index | sort -t/ -k9,10
4120 total
2068 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201010/0/
index
2053 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201010/1/
index
1 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201010/14/
index
1 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201010/2/index
The cluster showed two recovering shards.
I shut that node down, and moved its data directory completely away.
When it started up, it recreated the directory structure and adopted
exactly the same set of nodes, getting a complete copy of 0 and 1 and
incomplete copies of 14 and 2, same as before.
Then I restarted the process in place -- so that the new process came
up before the rest of the cluster started missing the esnode. It
started by recovering shard 2:
hoolock-data_esnode-9 /mnt$ du -mc /mnt/elasticsearch/data/hoolock/
nodes/0/indices/10//index | sort -t/ -k9,10
5125 total
2068 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201010/0/
index
2053 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201010/1/
index
1 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201010/14/
index
1005 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201010/2/
index
while still in the process of doing that, it discarded shard 0 and 14
as being redundant:
hoolock-data_esnode-9 /mnt$ du -mc /mnt/elasticsearch/data/hoolock/
nodes/0/indices/10//index | sort -t/ -k9,10
3638 total
2053 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201010/1/
index
1586 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201010/2/
index ### <-- still in recovery
by the time it had finished recovery of shard 2, it also discarded
shard 0:
hoolock-data_esnode-9 /mnt$ du -mc /mnt/elasticsearch/data/hoolock/
nodes/0/indices/10//index | sort -t/ -k9,10
2100 total
2100 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201010/2/
index
It then started recovering shard 0 (!):
hoolock-data_esnode-9 /mnt$ du -mc /mnt/elasticsearch/data/hoolock/
nodes/0/indices/10//index | sort -t/ -k9,10
2372 total
273 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201010/0/
index
2100 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201010/2/
index
and then recovered shard 14 too:
hoolock-data_esnode-9 /mnt$ du -mc /mnt/elasticsearch/data/hoolock/
nodes/0/indices/10//index | sort -t/ -k9,10
6205 total
2068 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201010/0/
index
2037 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201010/14/
index
2100 /mnt/elasticsearch/data/hoolock/nodes/0/indices/tweet-201010/2/
index
It's now happy and healthy, with no relocating shards and no zero-
length shards.
However, I think there's something causing elasticsearch to be too
aggressive about initiating recovery, and/or something causing cluster
balancing to "resonate".
flip
On Dec 22, 7:30 pm, mrflip mrf...@gmail.com wrote:
We're seeing a node in the ring flapping in recovery:
shard recovery is flapping · GitHub
If you see, it's caught in a loop where it starts recovering shard
[tweet-201010][3] :
[01:03:32,962][DEBUG][indices.memory ] [Centurion]
recalculating shard indexing buffer (reason=created_shard[tweet-201010]
[3]), total is [512mb] with [3] shards, each shard set to [170.6mb]
[01:03:32,962][DEBUG][index.shard.service ] [Centurion]
[tweet-201010][3] state: [CREATED]->[RECOVERING]
and then very shortly discards it:
[01:03:32,968][DEBUG][index.shard.service ] [Centurion]
[tweet-201010][3] state: [RECOVERING]->[CLOSED]
[01:03:32,971][DEBUG][indices.memory ] [Centurion]
recalculating shard indexing buffer (reason=removed_shard[tweet-201010]
[3]), total is [512mb] with [3] shards, each shard set to [170.6mb]
This cycles as often as one or two times a minute.
The cluster shows as green, with one recovering shard:
{
"active_primary_shards": 32,
"number_of_nodes": 16,
"number_of_data_nodes": 16,
"unassigned_shards": 0,
"timed_out": false,
"active_shards": 64,
"initializing_shards": 0,
"relocating_shards": 1,
"status": "green",
"cluster_name": "hoolock"
}
flip