Restarting one of the nodes resulted in unassigned shards

Hi all,

I've a 2 node cluster of ES having configuration. http://sprunge.us/gOIa

If I restart one of the nodes, I get too many uninitialized shards which
takes around 1 hour to recover (Right now, I have 4 indices with 60-80GB
in size). Is it expected? What are the possible reasons this might be
happening?

Before experiencing the crash, I saw that out of the 4 primary shards, 2
were on 1st node & 2 on the second node. After the long initialization
(after restart), I can observe that both all the primary shards for the
index are in the same node (via es-head). And now, restarting isn't
causing any issues (uninitialized shards). Can anyone explain how are
primary shards allocated among the nodes and what would have caused the
uninitialized shards being created?

--
Cheers,
Abhijeet R

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

shadyabhi wrote:

I've a 2 node cluster of ES having configuration. http://sprunge.us/gOIa

If I restart one of the nodes, I get too many uninitialized shards
which takes around 1 hour to recover (Right now, I have 4 indices with
60-80GB in size). Is it expected? What are the possible reasons this
might be happening?

Before experiencing the crash, I saw that out of the 4 primary
shards, 2 were on 1st node & 2 on the second node. After the long
initialization (after restart), I can observe that both all the
primary shards for the index are in the same node (via
es-head). And now, restarting isn't causing any issues
(uninitialized shards). Can anyone explain how are primary shards
allocated among the nodes and what would have caused the
uninitialized shards being created?

When you take down one of the nodes, ES will promote the shards on
the remaining node to primaries. When you bring back up the second
node, ES will begin syncing the primaries to replicas on the new
node. This phase occurs basically as fast as your network connection
between the nodes allows. ES is rarely a bottleneck here.

-Drew

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Drew,

Thanks for your reply. Is there a way to avoid this? Also, I noticed that
once all the 4 primary shards for a index were in one node, the restart
didn't mess anything up. If that's the case, then why aren't primary shards
being distributed again?

Also, is there a way to disable all this so that I can survive restarts
without this reshuffling?

On Mon, Feb 11, 2013 at 11:42 PM, Drew Raines aaraines@gmail.com wrote:

shadyabhi wrote:

I've a 2 node cluster of ES having configuration. http://sprunge.us/gOIa

If I restart one of the nodes, I get too many uninitialized shards
which takes around 1 hour to recover (Right now, I have 4 indices with
60-80GB in size). Is it expected? What are the possible reasons this
might be happening?

Before experiencing the crash, I saw that out of the 4 primary
shards, 2 were on 1st node & 2 on the second node. After the long
initialization (after restart), I can observe that both all the
primary shards for the index are in the same node (via
es-head). And now, restarting isn't causing any issues
(uninitialized shards). Can anyone explain how are primary shards
allocated among the nodes and what would have caused the
uninitialized shards being created?

When you take down one of the nodes, ES will promote the shards on
the remaining node to primaries. When you bring back up the second
node, ES will begin syncing the primaries to replicas on the new
node. This phase occurs basically as fast as your network connection
between the nodes allows. ES is rarely a bottleneck here.

-Drew

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Regards,
Abhijeet Rastogi (shadyabhi)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Abhijeet Rastogi wrote:

Thanks for your reply. Is there a way to avoid this?

It depends on how the shards' data have changed. Currently ES looks
for Lucene segment divergence, which for a large index could mean
that it doesn't have to resync the bigger segments. It's very likely
that it will have to resync most of the shard though, especially
after a big merge.

In your case, since your node crashed, if an index was being written
it might have been corrupted anyway and needed to be resynced.

Also, I noticed that once all the 4 primary shards for a index were
in one node, the restart didn't mess anything up. If that's the
case, then why aren't primary shards being distributed again?

ES only strives for distributing unique copies of the data, not
distributing primaries. Using the cluster reroute API[1], however, you
can cancel allocation of a primary which will effectively swap it
with the replica.

Here's a quick example. I like to use es2unix[2] for visualizing
shard distribution. The json api makes it too difficult to see
what's going on. You could also use a browser tool, but that's hard
to paste in email.

I have a index "wiki" with five shards and one replica:

% es shards wik
wiki 0 r STARTED 20019 218.9mb 229571296 127.0.0.1 Scanner    
wiki 0 p STARTED 20019 208.4mb 218617666 127.0.0.1 Lucas Brand
wiki 1 p STARTED 19898 210.3mb 220577145 127.0.0.1 Raza       
wiki 1 r STARTED 19898 208.4mb 218612909 127.0.0.1 Lucas Brand
wiki 2 r STARTED 19985 215.5mb 226006668 127.0.0.1 Scanner    
wiki 2 p STARTED 19985   221mb 231736530 127.0.0.1 Lucas Brand
wiki 3 p STARTED 20034 222.9mb 233803424 127.0.0.1 Scanner    
wiki 3 r STARTED 20034 220.5mb 231221871 127.0.0.1 Raza       
wiki 4 p STARTED 20064 222.7mb 233578869 127.0.0.1 Raza       
wiki 4 r STARTED 20064 214.3mb 224810852 127.0.0.1 Lucas Brand

Narrowing down to the 0-th shard:

% es shards | grep ^wiki\ 0
wiki 0 r STARTED 20019 218.9mb 229571296 127.0.0.1 Scanner    
wiki 0 p STARTED 20019 208.4mb 218617666 127.0.0.1 Lucas Brand

The primary is on Lucas Brand (the third col of output; use --verbose
for the column names). I can cancel its allocation there with:

curl -s -XPOST localhost:9200/_cluster/reroute -d '{
   "commands" : [
      {
         "cancel" : {
            "allow_primary" : true,
            "index" : "wiki",
            "shard" : 0,
            "node" : "Lucas Brand"
         }
      }
   ]
}'

Which will return a new cluster state routing output (supply
dry_run if you only want to see that without really changing the
cluster).

Now shard 0 looks like:

% es shards | grep ^wiki\ 0
wiki 0 p STARTED 20019 218.9mb 229571296 127.0.0.1 Scanner    
wiki 0 r STARTED 20019 208.4mb 218617666 127.0.0.1 Lucas Brand

ES canceled the primary on Lucas Brand, looked around for a replica
to use as the primary, and picked Scanner. Note that if you have
more than one replica, ES will pick one for you.

Also, is there a way to disable all this so that I can survive restarts
without this reshuffling?

Don't fear the reshuffling. You can do some tweaking with the
cluster allocation config options[3], but I would suggest in this
case not to worry about it. When you restart a node, ES should
quickly get to a yellow health level where you can search and index
while reallocation happens in the background.

It's usually better to let ES allocate and resync data for you. You
can waste a lot of time fixing something that's not an actual
problem.

-Drew

Footnotes:
[1] Elasticsearch Platform — Find real-time answers at scale | Elastic
[2] GitHub - elastic/es2unix: Command-line ES
[3] Elasticsearch Platform — Find real-time answers at scale | Elastic

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

1 Like