Elastic Rolling upgrade 6.6.0 to 6.8

Today, I tried to upgrade my 5 nodes cluster from 6.6.1 to 6.8.
I planned to apply the rolling upgrade procedure I found here:
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/rolling-upgrades.html

I then disabled the shards allocation and performed a synced flush.
But after stopping the first node, my cluster' status turned in RED not in yellow.
To avoid any problems, I re-started the stopped node.

After checking my logs on all my nodes, I found nothing special except a strange error message on one of them.

After notifying the index shard primary-replica resync completed, I saw a message displaying "global checkpoint sync failed". See below.

Any idea on what happened ?

Regards,

Jean-Marc

=====
[2019-11-07T09:13:36,774][INFO ][o.e.i.s.IndexShard ] [el8023.bc] [workflow-job-2019.10.08][3] primary-replica resync completed with 0 operations
[...]
[2019-11-07T09:13:36,783][INFO ][o.e.i.s.GlobalCheckpointSyncAction] [el8023.bc] [workflow-job-2019.10.08][3] global checkpoint sync failed
org.elasticsearch.transport.RemoteTransportException: [el8024.bc][10.120.120.37:9300][indices:admin/seq_no/global_checkpoint_sync]
Caused by: org.elasticsearch.transport.SendRequestTransportException: [el8024.bc][10.120.120.37:9300][indices:admin/seq_no/global_checkpoint_sync[p]]
at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:639) ~[elasticsearch-6.6.0.jar:6.6.0]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$1.sendRequest(SecurityServerTransportInterceptor.java:136) ~[?:?]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:542) ~[elasticsearch-6.6.0.jar:6.6.0]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:530) ~[elasticsearch-6.6.0.jar:6.6.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.performAction(TransportReplicationAction.java:873) ~[elasticsearch-6.6.0.jar:6.6.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.performLocalAction(TransportReplicationAction.java:824) ~[elasticsearch-6.6.0.jar:6.6.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:811) ~[elasticsearch-6.6.0.jar:6.6.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.6.0.jar:6.6.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction.doExecute(TransportReplicationAction.java:172) ~[elasticsearch-6.6.0.jar:6.6.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction.doExecute(TransportReplicationAction.java:100) ~[elasticsearch-6.6.0.jar:6.6.0]
at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:167) ~[elasticsearch-6.6.0.jar:6.6.0]
at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.apply(SecurityActionFilter.java:124) ~[?:?]
at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165) ~[elasticsearch-6.6.0.jar:6.6.0]
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:139) ~[elasticsearch-6.6.0.jar:6.6.0]

I stopped the same node again to check if I can find information in a GET _cluster/allocation/explain;

And the info I got mentionned problem in allocation because nodes are throttled.
Apparently, it reached the max concurrent incoming recoveries.
Can it be the reason why the cluster turned into a RED status ?

Regards,

JM

=====
"index" : "icfield-2019.10.07",
"shard" : 2,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "NODE_LEFT",
"at" : "2019-11-07T13:22:36.485Z",
"details" : "node_left[xxLNmYBhS5uIFVOI2Ushaw]",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "throttled",
"allocate_explanation" : "allocation temporarily throttled",
"node_allocation_decisions" : [
{
"node_id" : "5h1Go2rBQCiJ0boDotbYEA",
"node_name" : "el7610.bc",
"transport_address" : "10.120.111.28:9300",
"node_attributes" : {
"ml.machine_memory" : "16656986112",
"ml.max_open_jobs" : "20",
"xpack.installed" : "true",
"ml.enabled" : "true"
},
"node_decision" : "throttled",
"deciders" : [
{
"decider" : "throttling",
"decision" : "THROTTLE",
"explanation" : "reached the limit of incoming shard recoveries [2], cluster setting [cluster.routing.allocation.node_concurrent_incoming_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
}
]
}

If your cluster health goes red when you shut down a single node then the most likely explanation is that you have some indices with number_of_replicas: 0. Every index must have replicas in order to tolerate a node being shut down.

Thank you so much, David, for pinpointing this.

It was the index kibana_sample_data_logs staying in the cluster without replica.
I think it is a test data set I left in the cluster from the start.
I removed it, stopped the same node and my cluster health swtiched to yellow.

Thank you once again.

Regards,

JM

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.