[RESOLVED] What SHOULD happen when a data node leaves a cluster? Help please

Hi all,

In my Dev environment of two data nodes, primaries = 1, replicas = 1, when I shut down one of the data nodes, all remaining shards are promoted to primaries (as to be expected I believe), the cluster state turns yellow (again as expected), indexing and searches continue.

What is interesting here is that the replicas remain unallocated. They don't get re-created to be on the one remaining active data node. Is this expected behavior?

What is more interesting is in my Prod environment of 6 data nodes, primaries = 6, replicas = 1, and I stop one data node, the cluster goes immediately red, not yellow, indexing stops (bad) and shards are VERY slowly initialized. After 5 minutes, the cluster stats api did not change from this:

   "cluster_name": "elasticsearch-prod",
   "status": "red",
   "timed_out": false,
   "number_of_nodes": 12,
   "number_of_data_nodes": 5,
   "active_primary_shards": 2033,
   "active_shards": 3371,
   "relocating_shards": 0,
   "initializing_shards": 10,
   "unassigned_shards": 611,
   "number_of_pending_tasks": 0,
   "number_of_in_flight_fetch": 8

My most urgent question is why the active replicas on the remaining 5 data nodes did not instantly become primaries, keeping the cluster yellow instead of red, and allowing indexing to continue uninterrupted?

Secondary, why would the re-initializing of the shards be so slow? Even after I restarted the stopped data node, initializing was extremely slow for what should be (I think) a local operation?

I very much appreciate your thoughts on this. As it stands now, the cluster is very fragile when just one node going offline can cause all indexing to stop!

After about 10 minutes of initializing, the cluster finally got to a yellow state again, but there were still 515 unassigned shards:

   "cluster_name": "elasticsearch-prod",
   "status": "yellow",
   "timed_out": false,
   "number_of_nodes": 13,
   "number_of_data_nodes": 6,
   "active_primary_shards": 2047,
   "active_shards": 3465,
   "relocating_shards": 0,
   "initializing_shards": 12,
   "unassigned_shards": 515,
   "number_of_pending_tasks": 2,
   "number_of_in_flight_fetch": 8

100 shards in 10 minutes seems quite slow to me, and the fact that the cluster went red at all is very concerning. I must be doing something wrong. :smile:

Thank you again for your time and thoughts.

1 Like

Ok, shameless bump of my own thread, but I'm very worried about my cluster stability given the behavior I'm seeing (see thread). Just hoping someone might be able to comment.

Many thanks.

Yup. It wouldn't help with stability to initialize a second copy of the same shard on a node because that node could go down and you'd lose all the copies. It wouldn't help with performance because its not the files that are being used - its the io bandwidth and cpu time. It'd hurt with performance because there'd end up being two copies in the disk cache and two indexes maintained at once. So elasticsearch doesn't do it.

I assume you mean you that each of your indexes has six shards and one replica, hopefully giving you two way redundancy.

They certainly should. Are you triple sure you have replicas set to 1. Check /_cat/shards api. Check that all the extra copies exist and are allocated in a sane way.

When you shut down the node was the cluster yellow? That could do it.

I run with 2 replicas on most of my indexes with the goal of having three way redunancy and the only time I had this happen was when I had something nasty take out multiple nodes. Or that time when I accidentally configured the number of replicas to 0.

Its because shard recovery is basically an rsync followed by a transaction log replay. The master copy of the shard and replica copy drift apart over time. Sometimes a lot. And when the recovery process comes in it ends up having to copy a lot. The synced_flush feature in 1.6 is an effort to circumvent the copy but it only works on indexes that haven't been written to while the node is down. That's what happens to lots of indexes in the logging use case so its a good start.

Thank you Nik. Very much appreciate your detailed reply.

Makes total sense. Thank you for the explanation.

Yes, sorry for the poor terminology.

Triple sure. :slight_smile: Here's a snip of one of the indexes, but the remaining are all allocated like it:

derbysoft-apache-20150715      2 p STARTED     79416    5.2mb elasticsearch-bdprodes09 
derbysoft-apache-20150715      2 r STARTED     79416    5.1mb elasticsearch-bdprodes07 
derbysoft-apache-20150715      0 r STARTED     79363    5.1mb elasticsearch-bdprodes10 
derbysoft-apache-20150715      0 p STARTED     79363    5.1mb  elasticsearch-bdprodes06 
derbysoft-apache-20150715      3 r STARTED     79658    5.2mb elasticsearch-bdprodes10 
derbysoft-apache-20150715      3 p STARTED     79658    5.2mb elasticsearch-bdprodes07 
derbysoft-apache-20150715      1 p STARTED     79268    5.1mb  elasticsearch-bdprodes06 
derbysoft-apache-20150715      1 r STARTED     79268    5.2mb elasticsearch-bdprodes09 
derbysoft-apache-20150715      5 r STARTED     79633    5.2mb elasticsearch-bdprodes08 
derbysoft-apache-20150715      5 p STARTED     79633    5.1mb  elasticsearch-bdprodes05 
derbysoft-apache-20150715      4 p STARTED     79163    5.1mb elasticsearch-bdprodes08 
derbysoft-apache-20150715      4 r STARTED     79163    5.2mb  elasticsearch-bdprodes05 

elasticsearch-bdprodes05-10 are the 6 data nodes, so things seem to be quite nicely spread out amongst the 6.

Nope. It was definitely green before I stopped the node.

That's exactly what I was thinking would happen, and was so surprised when it did not! I was wondering if perhaps there was some config setting in my elasticsearch.yml file that might be causing things to behave this way, but I sure couldn't see anything that stood out. I can post it if it would be of use though.

Understood. Again, thanks for the explanation! What's again strange about this is that I had stopped indexing, and what was taking the vast majority of the time was initializing the previous days indexes (I keep 30 days open, then close everything else). It's just weird.

I really appreciate your help. Please do let me know if there is any more information I could supply that might help figure this out.


What version are you on?

Hi Mark,

I'm on 1.6.0.


What happens in your logs when you shut a node down?
It might be worth increasing the allocation logging level and trying this again and see what is reported.

Very little if I recall. The node is reported as removed by the master, then very slowly I see messages like this:

[2015-07-15 19:28:01,271][INFO ][indices.recovery         ] [elasticsearch-bdprodes09] Recovery with sync ID 1525 numDocs: 1525 vs. true

Do you happen to know what logger value correspond to the allocation logging? I can certainly increase it and do the test again. Should I increase the log level on all node types, or just the data nodes?


Welp, I figured it out. The whole "are you triple sure you have replicas set to 1" was bothering me, so I looked more closely at the _cat/indices API, and sure enough, there were some rogue indexes that had been created outside of the normal process that had 0 replicas. Grumble.

I changed those to 1, stopped a data node, and it behaved exactly as it should. Replicas became Primaries, the cluster never went Red, just to Yellow, and indexing/searching continued as if nothing had happened.

Beautiful. :smile:
Thank you Mark and Nik for your insights, and steering me to find the problem!

I'm glad it all worked out!