Cluster won't start after upgrade from 7.13.1 to 7.13.2

Hi,

We upgraded our Elasticsearch cluster Yesterday from 7.13.1 to 7.13.2. After upgrading and rebooting all the nodes, the cluster would not start. All the data nodes produce the same error that looks like this:

[2021-06-18T10:19:26,833][INFO ][o.e.c.c.JoinHelper       ] [escluster-data1] failed to join {escluster-master1}{X4UY1tjkTYiz3DshbmzF8w}{O5IHofsTRF6FEYtFqG6G7g}{172.16.10.121}{172.16.10.121:9300}{ilmr}{ml.machine_memory=16761880576, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=false} with JoinRequest{sourceNode={escluster-data1}{Vb002pvZTYKt13A1zySQKQ}{UcRDc6Y8R521AWboW80AoA}{172.16.10.101}{172.16.10.101:9300}{cdfhilmrstw}{ml.machine_memory=16761880576, xpack.installed=true, transform.node=true, ml.max_open_jobs=512, ml.max_jvm_size=1073741824}, minimumTerm=14, optionalJoin=Optional.empty}
org.elasticsearch.transport.RemoteTransportException: [escluster-master1][172.16.10.121:9300][internal:cluster/coordination/join]
Caused by: java.lang.IllegalArgumentException: invalid index.write.wait_for_active_shards[2]: cannot be greater than number of shard copies [1]
        at org.elasticsearch.cluster.metadata.IndexMetadata$Builder.build(IndexMetadata.java:1358) ~[elasticsearch-7.13.2.jar:7.13.2]
        at org.elasticsearch.cluster.metadata.Metadata$Builder.put(Metadata.java:1125) ~[elasticsearch-7.13.2.jar:7.13.2]
        at org.elasticsearch.cluster.metadata.Metadata$Builder.updateNumberOfReplicas(Metadata.java:1337) ~[elasticsearch-7.13.2.jar:7.13.2]
        at org.elasticsearch.cluster.routing.allocation.AllocationService.adaptAutoExpandReplicas(AllocationService.java:277) ~[elasticsearch-7.13.2.jar:7.13.2]
        at org.elasticsearch.cluster.coordination.JoinTaskExecutor.execute(JoinTaskExecutor.java:193) ~[elasticsearch-7.13.2.jar:7.13.2]
        at org.elasticsearch.cluster.coordination.JoinHelper$1.execute(JoinHelper.java:124) ~[elasticsearch-7.13.2.jar:7.13.2]
        at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:691) ~[elasticsearch-7.13.2.jar:7.13.2]
        at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:313) ~[elasticsearch-7.13.2.jar:7.13.2]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:208) ~[elasticsearch-7.13.2.jar:7.13.2]
        at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:62) ~[elasticsearch-7.13.2.jar:7.13.2]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:140) ~[elasticsearch-7.13.2.jar:7.13.2]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:139) ~[elasticsearch-7.13.2.jar:7.13.2]
        at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:177) ~[elasticsearch-7.13.2.jar:7.13.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:673) ~[elasticsearch-7.13.2.jar:7.13.2]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:241) ~[elasticsearch-7.13.2.jar:7.13.2]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:204) ~[elasticsearch-7.13.2.jar:7.13.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
        at java.lang.Thread.run(Thread.java:831) [?:?]

I have modified the hostnames and IPs on the above error but the rest is exactly as written in the log files. The main error seems to be

invalid index.write.wait_for_active_shards[2]: cannot be greater than number of shard copies [1]

And I don't know to resolve that withough access to the cluster itself.

We have thousands of hours worth of dashboards and saved objects and so on trapped in that cluster so I am really under pressure right now. Any help will be appreciated.

It looks like there's a problematic interaction between index.write.wait_for_active_shards and index.auto_expand_replicas. Do you have dedicated master nodes? Your node names suggest that you do. If so, you should be able to remove index.write.wait_for_active_shards from all indices first:

PUT _settings
{"index.write.wait_for_active_shards":null}

I opened an issue on Github:

Hi @DavidTurner Yes we have dedicated master nodes but I can't access the Dev Tool to try the proposed fix because Kibana won't start either.

 [2021-06-18T15:07:33,910][INFO ][o.e.x.s.a.AuthenticationService] [escluster-master1] Authentication of [kibana_system] was terminated by realm [reserved] - failed to authenticate user [kibana_system]
[2021-06-18T15:07:36,407][ERROR][o.e.x.s.a.e.ReservedRealm] [escluster-master1] failed to retrieve password hash for reserved user [kibana_system]
org.elasticsearch.action.UnavailableShardsException: at least one primary shard for the index [.security-7] is unavailable
        at org.elasticsearch.xpack.security.support.SecurityIndexManager.getUnavailableReason(SecurityIndexManager.java:148) ~[x-pack-security-7.13.2.jar:7.13.2]
        at org.elasticsearch.xpack.security.authc.esnative.NativeUsersStore.getReservedUserInfo(NativeUsersStore.java:492) [x-pack-security-7.13.2.jar:7.13.2]
        at org.elasticsearch.xpack.security.authc.esnative.ReservedRealm.getUserInfo(ReservedRealm.java:220) [x-pack-security-7.13.2.jar:7.13.2]
        at org.elasticsearch.xpack.security.authc.esnative.ReservedRealm.doAuthenticate(ReservedRealm.java:96) [x-pack-security-7.13.2.jar:7.13.2]
        at org.elasticsearch.xpack.security.authc.support.CachingUsernamePasswordRealm.authenticateWithCache(CachingUsernamePasswordRealm.java:188) [x-pack-security-7.13.2.jar:7.13.2]
        at org.elasticsearch.xpack.security.authc.support.CachingUsernamePasswordRealm.authenticate(CachingUsernamePasswordRealm.java:105) [x-pack-security-7.13.2.jar:7.13.2]
        at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$consumeToken$17(AuthenticationService.java:478) [x-pack-security-7.13.2.jar:7.13.2]
        at org.elasticsearch.xpack.core.common.IteratingActionListener.run(IteratingActionListener.java:103) [x-pack-core-7.13.2.jar:7.13.2]
        at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.consumeToken(AuthenticationService.java:533) [x-pack-security-7.13.2.jar:7.13.2]
        at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$extractToken$13(AuthenticationService.java:445) [x-pack-security-7.13.2.jar:7.13.2]
        at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.extractToken(AuthenticationService.java:455) [x-pack-security-7.13.2.jar:7.13.2]
        at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$checkForApiKey$5(AuthenticationService.java:396) [x-pack-security-7.13.2.jar:7.13.2]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:134) [elasticsearch-7.13.2.jar:7.13.2]
        at org.elasticsearch.xpack.security.authc.ApiKeyService.authenticateWithApiKeyIfPresent(ApiKeyService.java:402) [x-pack-security-7.13.2.jar:7.13.2]
        at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.checkForApiKey(AuthenticationService.java:377) [x-pack-security-7.13.2.jar:7.13.2]
        at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$checkForBearerToken$3(AuthenticationService.java:361) [x-pack-security-7.13.2.jar:7.13.2]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:134) [elasticsearch-7.13.2.jar:7.13.2]
        at org.elasticsearch.xpack.security.authc.TokenService.tryAuthenticateToken(TokenService.java:393) [x-pack-security-7.13.2.jar:7.13.2]
        at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.checkForBearerToken(AuthenticationService.java:357) [x-pack-security-7.13.2.jar:7.13.2]
        at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$authenticateAsync$0(AuthenticationService.java:338) [x-pack-security-7.13.2.jar:7.13.2]
        at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$lookForExistingAuthentication$8(AuthenticationService.java:414) [x-pack-security-7.13.2.jar:7.13.2]
        at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lookForExistingAuthentication(AuthenticationService.java:425) [x-pack-security-7.13.2.jar:7.13.2]
        at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.authenticateAsync(AuthenticationService.java:333) [x-pack-security-7.13.2.jar:7.13.2]
        at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.access$000(AuthenticationService.java:274) [x-pack-security-7.13.2.jar:7.13.2]
        at org.elasticsearch.xpack.security.authc.AuthenticationService.authenticate(AuthenticationService.java:152) [x-pack-security-7.13.2.jar:7.13.2]
        at org.elasticsearch.xpack.security.authc.AuthenticationService.authenticate(AuthenticationService.java:137) [x-pack-security-7.13.2.jar:7.13.2]
        at org.elasticsearch.xpack.security.rest.SecurityRestFilter.handleRequest(SecurityRestFilter.java:75) [x-pack-security-7.13.2.jar:7.13.2]
        at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:269) [elasticsearch-7.13.2.jar:7.13.2]
        at org.elasticsearch.rest.RestController.tryAllHandlers(RestController.java:351) [elasticsearch-7.13.2.jar:7.13.2]
        at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:192) [elasticsearch-7.13.2.jar:7.13.2]
        at org.elasticsearch.http.AbstractHttpServerTransport.dispatchRequest(AbstractHttpServerTransport.java:451) [elasticsearch-7.13.2.jar:7.13.2]
        at org.elasticsearch.http.AbstractHttpServerTransport.handleIncomingRequest(AbstractHttpServerTransport.java:516) [elasticsearch-7.13.2.jar:7.13.2]
        at org.elasticsearch.http.AbstractHttpServerTransport.incomingRequest(AbstractHttpServerTransport.java:378) [elasticsearch-7.13.2.jar:7.13.2]
        at org.elasticsearch.http.netty4.Netty4HttpRequestHandler.channelRead0(Netty4HttpRequestHandler.java:31) [transport-netty4-client-7.13.2.jar:7.13.2]
        at org.elasticsearch.http.netty4.Netty4HttpRequestHandler.channelRead0(Netty4HttpRequestHandler.java:17) [transport-netty4-client-7.13.2.jar:7.13.2]
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at org.elasticsearch.http.netty4.Netty4HttpPipeliningHandler.channelRead(Netty4HttpPipeliningHandler.java:47) [transport-netty4-client-7.13.2.jar:7.13.2]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) [netty-codec-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) [netty-codec-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) [netty-codec-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324) [netty-codec-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:296) [netty-codec-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) [netty-handler-4.1.49.Final.jar:4.1.49.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
        at ...

I am guessing because the shards storing the hashes are not available?

Ah yes you will need to use the file realm to give yourself access without needing any data nodes in the cluster, and then use curl rather than the dev tools.

Thanks for your support @DavidTurner. I used file realm and managed restore access to the cluster using your proposed fix.

Will I need to revert any change once the original issue is fixed?

I imagine you will want to remove the file realm again, and also if you want to use index.write.wait_for_active_shards then you'll need to reinstate that, but until the bug I opened is fixed you must manually make sure that index.write.wait_for_active_shards is consistent with index.auto_expand_replicas. For instance if you want to set index.write.wait_for_active_shards: 2 then you can't have an auto-expand config of 0-??, the lower bound must be at least 1.

Yes, I removed the file realm config after sucessuly running

--header 'Content-Type: application/json' --data-raw '{"index.write.wait_for_active_shards":null}'

with curl and then restarted the elasticsearch. This allowed the other modes to join the cluster and kibana to start but new events wheren't coming through. So I ran the following:

PUT _settings
{"index.write.wait_for_active_shards":1}

and some events started coming in but I just realized this keep poping on the Discover page:
image

and I get warnings like these in the elasticsearch logs

[2021-06-18T19:29:25,942][WARN ][o.e.c.InternalClusterInfoService] [escluster-master1] failed to retrieve stats for node [yWWfACjqRV-sI13JHlPADg]: [escluster-data2][172.16.10.102:9300][cluster:monitor/nodes/stats[n]]
[2021-06-18T19:29:25,945][WARN ][o.e.c.InternalClusterInfoService] [escluster-master1] failed to retrieve shard stats from node [yWWfACjqRV-sI13JHlPADg]: escluster-data2][172.16.10.102:9300][indices:monitor/stats[n]]

Those log messages are only fragments, they say that something went wrong obtaining stats from escluster-data2 but nothing about why. Would you share the full logs?

Has the cluster recovered to green health?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.