Cluster crash after upgrading Kibana to 6.0

Hello all,

Our cluster was upgraded from 5.6.3 to 6.0.0 and has now crashed and will not recover. I followed the directions and used a rolling upgrade. It seems upgrading Kibana to 6.0.0 caused some sort of failure. The migration assistant failed to migrate the kibana index and I couldn't figure out why so I just deleted it; some time later the cluster began to crash. Below are some cluster stats and the sequence of upgrades:

8 nodes: 6 data nodes, 2 indexers, 3 master nodes. 64GB RAM, 31GB for min/max heap. x-pack installed with free license. 2 instances of Kibana, each running on an index node.

2017-11-25: Rolling upgrade of all nodes.
2017-11-27: Upgrade of Kibana nodes.

20:29 UTC: "[node1] failed to execute on node [_-9823d-SD...] (..) exception (..) [node2] node not connected"

2017-11-27: At 23:55 UTC node6 crashed and was removed from the cluster. The process was still running and would not terminate without "kill -9". On restart the node6 reports timeout discovering master despite all cluster services running.

At this point I suspected a problem with the Debian upgrade procedure so I shutdown the entire cluster and reinstalled Elasticsearch on all data nodes. Now the cluster is stuck at 15% shard recovery for over 12 hours.

Thanks in advance for the help.

I ran the migration assistant prior to the upgrade and it only warned about issues with index compatibility such as use of the _all field. The 6.0 documentation says old indices are backwards compatible so I don't believe this has anything to do with the failure.

Unfortunately when I reinstalled Elasticsearch the log directory was deleted by the Debian package manager. I only have old logs from the two indexer/kibana nodes and new logs from the data nodes.

Cluster discovery fails for some reason even though all nodes are reachable and ES is running.

node6:
failed to send join request to master [node3 10.1.1.3]... reason ... [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]

node3:

[2017-11-28T11:18:34,870][DEBUG][o.e.a.a.i.s.TransportIndicesStatsAction] [node3] failed to execute [indices:monitor/stats] on node [....lCtR12bI7...vWyeg]
org.elasticsearch.transport.NodeNotConnectedException: [node6][10.1.1.6:9300] Node not connected

index_node1:

[2017-11-28T11:17:40,580][INFO ][o.e.d.z.ZenDiscovery ] [index_node1] failed to send join request to master [{node3}{HIwh.....7.eZ...Vvlort-w}{gJLrV....aoq72q5vbGfA}{10.1.1.3}{10.1.1.3:9300}{rack=rackA6-1}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]

Is a downgrade to 5.6.3 possible ? I'm OK with loosing recent data.. more concerned about operational status.

Downgrading did not work. Our cluster is dead.

Do you have full stack traces somewhere, or complete logs? Maybe that include starting of the node to see what happens next? What about the master node logs?

Does the cluster form at all or does it become unstable once you are using kibana with it? Or does it become unstable once recovery was started?

Have you checked the pending tasks? hot threads? node stats?

What exactly happened to node6? Did it respond to anything like a HTTP request before you killed it? Any log messages? Any dmesg output?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.