Elasticsearch not reassigning shards after node rejoined

Hi

My cluster (running Elasticsearch 6.4.3) consists of three nodes running on seperate VMs. One VM crashed. After it was brought back to life, the node rejoined the cluster, but shards reassignment did not start. CPU usage on the rejoining node is marginal. What's more, _cat and _nodes APIs seem to be unavailable - something must have gotten stuck (found a similar problem here: CAT api doesn't respond).
Don't see any relevant information in the logs (only that the node rejoined the cluster).

Cluster health is as follows:

{
  "cluster_name" : "elasticsearch",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 3609,
  "active_shards" : 7225,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 3586,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 66.81771941181911
}

My hot threads: https://pastebin.com/SstwLiFU

Manually forcing shards reassignment did not help:

curl -X PUT '10.10.10.40:9200/_cluster/settings' -H 'Content-Type: application/json' -d '{"transient":{"cluster.routing.allocation.enable":"all"}}

How do I proceed further? How do I diagnose the problem?

Thanks!

Which of the nodes is the one you just restarted? It looks like elastic-data-2 is doing absolutely nothing so maybe that one?

If you shut the restarted node down do the stuck APIs come back to life? And if you restart it they stop again?

Can you restart the node with logger.org.elasticsearch.transport: TRACE in its config file so we can perhaps see whether it's receiving anything useful on the network? I can't immediately think of a way a node could be stuck like this.

Could you share the logs from (a) the master and (b) the restarted node? It might not look like there's much of interest there but it'd be good to be sure.

@DavidTurner thanks for replying.

Which of the nodes is the one you just restarted?

The restarted node is elastic-data-1.

If you shut the restarted node down do the stuck APIs come back to life?

Unfortunately they do not.

Can you restart the node with logger.org.elasticsearch.transport: TRACE in its config file so we can perhaps see whether it's receiving anything useful on the network? I can't immediately think of a way a node could be stuck like this.

Sure, here you go: https://www58.zippyshare.com/v/3tMpUKpp/file.html

Could you share the logs from (a) the master

Sure, here you go: https://pastebin.com/4r7CYKDb

I see. Looking harder at the hot threads, it looks like the master node is overwhelmed by stats requests. It only has 5 threads to perform these requests but they can be quite expensive. Does the cluster recover if you stop these requests?

How do I do that? I can't recall issuing such request manually.

It looks like you are self-monitoring this cluster, so disabling monitoring would help. Also you've been running things like GET _cat/nodes which does a stats request (amongst other things) so try stopping those.

There are no GET curls currently running on any nodes. Killed prometheus exporters running on master node, no visible change in CPU usage. Hot threads after the kill: https://pastebin.com/QAa9X2tx

edit: What I managed to notice is that after I restart the problematic node (elastic-data-1), number of unassigned_shards goes down by a few (<10), then gets stuck again. So it would seem that I'm only a thousand restarts away from restoring the cluster to full health :smiley:

I wonder if there's a backlog of stats requests that the master is working through even though nothing is waiting on them any more. It's a bit hard to tell since we'd have to call the stats API to find out how big the backlog is, or else start to look at heap dumps.

Are you in a position where you can do a full cluster restart? If so, that's probably the simplest way to clear this backlog.

I was hoping to avoid that. Having the cluster do not accept data for a minute or so while it restarts is acceptable, but since the cluster is in a sort of a... strange state, I fear of data loss. What would be the safe option here?

The other option I can think of right now is to wait for the master to work its way through the backlog of stats requests, and hope that nothing else goes wrong in that time. The trouble is that these requests run on a fairly small threadpool and I can't think of a way to work out how big the backlog is without using this same threadpool so we can't really judge progress.

Or rather, I can, but it's not pretty. You could take a heap dump and look for a org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor with a name ending in /management and then count the items in its workQueue field. Then take another heap dump and look for the same thing and see how far we've progressed.

Well, later in the evening the master-node Elasticsearch crashed on it's own (supposedly after being stuck for a while in the unknown state we discussed), which caused the master role to be assigned to a different node.

Long story short:

  • all three nodes are now in the cluster
  • _nodes API is still veeery slow to respond (in fact the query has been running for ten minutes or so and still no response)
  • cluster is in yellow state with shards being reassigned
  • shards reassignment is very slow, as the current master reassigns shards using one thread / core only with 100% core usage (although the VM is a multi-core one)

The question: can I (how?) make the shards reassignment run on multiple threads so that it does not take a couple of days (that would be the current projection).

I am concerned that there is something quite wrong with the infrastructure under this cluster, or else that it's still receiving excessively many heavy monitoring requests. Can you get another thread dump? Are all the management threads on elastic-data-0 still busy serving stats requests? Are there any OS-level indications that IO is running very slowly on this node?

I am focussing on this rather than the speed of recoveries of the shards because I think this is the more fundamental problem. GET _nodes should be very quick, and if it is not then it's to be expected that lots of other things will also get jammed up too.

I believe you might have been correct suggesting that the problem lies on elastic-data-0 node. Sometime mid-lunch today the Elasticsearch process on this node crashed. When I noticed that, I decided to restart the whole node (VM). It took long minutes (much longer than it should), presumably some threadlocks / process conflicts were involved - those might have been the root cause for the whole issue. After the node was up, it rejoined the cluster and shard reallocation started, which was not as slow as previously and definitely did use more than one thread.

So far the cluster is green, although there are some metric-collection errors that result in Kibana monitoring graphs being partially incomplete. Still, so far the cluster looks much better now, with room to make additional improvements.

I will be watching it closely and report back after the weekend.

Thank you for your help so far!

1 Like

is routing allocation is enabled?

curl -XGET "localhost:9200/_cluster/settings?pretty"

oh ignore I just show you already did that

All right, so here's the update:

The cluster was behaving itself during the weekend (all green), which is good. There is, however, something that bugs me regarding stats collection. I keep getting these errors in my logs (not only on elastic-data-0) and Kibana's monitoring graphs are oftenly missing some periods. What could be going on here? I've upgraded the cluster from 6.4.3 to 6.7.1, but observed no improvement:

[2019-04-29T12:34:02,292][ERROR][o.e.x.m.c.i.IndexStatsCollector] [elastic-data-0] collector [index-stats] timed out when collecting data
[2019-04-29T12:34:12,306][ERROR][o.e.x.m.c.n.NodeStatsCollector] [elastic-data-0] collector [node_stats] timed out when collecting data

Two additional questions:

  • In earlier posts we discussed a situation where management threads get somewhat unresponsive and, as a result, "diagnostic" APIs cannot be queried. Wouldn't it be a good idea to seperate these two? I mean - no matter how bad the cluster gets, you can always call the API to get some diagnostic info... Currently such problems make you half-blind.
  • These management threads - is it possible (and wise) to simply increase their count in order to make things a little less congested?

This suggests there's still something wrong with your cluster or its underlying infrastructure. They tell us that it's taking more than 10 seconds to collect stats from one or more nodes, and that is unexpected. These stats requests are indistinguishable from other "diagnostic" requests which is why they are on the same threadpool.

Do things like GET _nodes/stats take over 10 seconds when you run them manually?

Can you see what is taking so long by looking at the hot threads output?

I understand. Although I still don't like that approach :wink:

What about this?

GET _nodes/stats takes about 20 seconds to complete when queried through Kibana and about a second less when curled directly to the master node via SSH.
On the other hand, GET /_nodes/hot_threads took about 2 seconds to complete. Anyway, here are results: ::: {elastic-data-2}{p5ZZL7h0SsO8_2HmyXEw4w}{-_h5vuRwRU6JfftVGBPkRg}{10.10.10.38 - Pastebin.com

You seem to have quite a large number of shards in your cluster. I wonder if this could be a contributing factor.


According to this blog entry, in my scenario (where most of indices are small, which then causes a lot of small shards to be created) the overhead would be rather big. The question is whether the effect of that would be correlated to the problems I'm experiencing.