Cluster is frequently lost with no master and data nodes removed after upgrade from 7.2.1 to 7.10.2

rpremnath · May 2, 2021, 4:09pm

we are testing upgrade of Elasticsearch from 7.2.1 to 7.10.2 and we have developed custom Rollup and Alias management plugin using NodeClient. In our 3 node set up, all nodes are master, data, ingest nodes. We did not had any issues with plugin or cluster in 7.2.1 version.
But when we upgrade to elasticsearch 7.10.2, when Jobs of Rollup plugin runs we see some times searches timeout (even with 60 second timeout) and sometimes index creation fails and eventually after running for ~ 2 hours , cluster becomes Red with no master or one of the node becomes out of the cluster.
Using Nodeclient and Indexdmin we make couple of calls, Timeout of these calls are at random.
Some times other calls like bulkinsert fails due to Index does not exists.
Timeout

GetIndexResponse getIndexResponse = esClient.admin().indices().getIndex(request).actionGet(new TimeValue(60000));

Timeout
-------
``searchSourceBuilder.query(QueryBuilders.rangeQuery(RollupConstants.ROLLUP_TIMESTAMP_FIELD).from(start).to(end)*

           .format(RollupConstants.ROLLUP_EPOCH_FORMAT));*

   SearchResponse searchResponse = esClient.prepareSearch(idx).setSource(searchSourceBuilder)*

           .get(new TimeValue(60000));``

Errors as seen
[2021-05-02T14:16:39,299][DEBUG][o.e.c.c.LeaderChecker ] [elasticsearch-workernode-2] leader [{elasticsearch-workernode-1}{U-cUiAEcReiR2t_rQdlr7g}{E5hjAPRYSHC4_k-gsCnONg}{192.168.245.81}{192.168.245.81:9300}{dimr}] has failed 3 consecutive checks (limit [cluster.fault_detection.leader_check.retry_count] is 3); last failure was:
org.elasticsearch.transport.RemoteTransportException: [elasticsearch-workernode-1][192.168.245.81:9300][internal:coordination/fault_detection/leader_check]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [{elasticsearch-workernode-2}{RFYupN13R7O858OJTigTZQ}{-Faf5iHxR32loMgKmQRDTg}{192.168.245.16}{192.168.245.16:9300}{dimr}] has been removed from the cluster
** at org.elasticsearch.cluster.coordination.LeaderChecker.handleLeaderCheck(LeaderChecker.java:192) ~[elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.cluster.coordination.LeaderChecker.lambda$new$0(LeaderChecker.java:113) ~[elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:72) ~[elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:207) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:107) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:89) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:700) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:142) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:117) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:82) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74) [transport-netty4-client-7.10.2.jar:7.10.2]**

**[2021-05-02T14:16:39,414][DEBUG][o.e.c.c.PreVoteCollector ] [elasticsearch-workernode-2] TransportResponseHandler{PreVoteCollector{state=Tuple [v1=null, v2=PreVoteResponse{currentTerm=1, lastAcceptedTerm=1, lastAcceptedVersion=254}]}, node={elasticsearch-workernode-1}{U-cUiAEcReiR2t_rQdlr7g}{E5hjAPRYSHC4_k-gsCnONg}{192.168.245.81}{192.168.245.81:9300}{dimr}} failed**
**org.elasticsearch.transport.RemoteTransportException: [elasticsearch-workernode-1][192.168.245.81:9300][internal:cluster/request_pre_vote]**
**Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting PreVoteRequest{sourceNode={elasticsearch-workernode-2}{RFYupN13R7O858OJTigTZQ}{-Faf5iHxR32loMgKmQRDTg}{192.168.245.16}{192.168.245.16:9300}{dimr}, currentTerm=1} as there is already a leader**
**	at org.elasticsearch.cluster.coordination.PreVoteCollector.handlePreVoteRequest(PreVoteCollector.java:135) ~[elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.cluster.coordination.PreVoteCollector.lambda$new$0(PreVoteCollector.java:74) ~[elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:72) ~[elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:305) ~[elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:743) ~[elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.10.2.jar:7.10.2]**
**	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]**
**	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]**
**	at java.lang.Thread.run(Unknown Source) [?:?]**

**[2021-05-02T16:03:37,972][DEBUG][o.e.a.s.m.TransportMasterNodeAction] [elasticsearch-workernode-2] timed out while retrying [indices:admin/get] after failure (timeout [30s])**
**[2021-05-02T16:03:37,973][WARN ][r.suppressed             ] [elasticsearch-workernode-2] path: /mnr-metadata, params: {ignore_unavailable=false, expand_wildcards=open,closed, allow_no_indices=false, ignore_throttled=false, index=mnr-metadata}**
**org.elasticsearch.discovery.MasterNotDiscoveredException: null**
**	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onTimeout(TransportMasterNodeAction.java:230) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:335) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:601) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) [elasticsearch-7.10.2.jar:7.10.2]**
**	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]**
**	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]**
**	at java.lang.Thread.run(Unknown Source) [?:?]**
**[2021-05-02T16:03:39,485][WARN ][r.suppressed             ] [elasticsearch-workernode-2] path: /_bulk, params: {}**
**org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];**
**	at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:190) ~[elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:590) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:452) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$2.onTimeout(TransportBulkAction.java:624) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:335) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:601) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) [elasticsearch-7.10.2.jar:7.10.2]**
**	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]**
**	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]**
**	at java.lang.Thread.run(Unknown Source) [?:?]**

I have checked the configuration of cluster settings and could not see any glaring changes.

Can you please let me know why the cluster becomess unstable when couple of API calls being made on IndicesAdminClient of NodeClient from Plugin causes the cluster to go down.

Thanks in Advance.

warkolm · May 3, 2021, 1:31am

Welcome to our community!

Please format your code/logs/config using the </> button, or markdown style back ticks. It helps to make things easy to read which helps us help you

rpremnath · May 3, 2021, 4:31am

Thank you Mark . Here is the formatted request of the issue.

we are testing upgrade of Elasticsearch from 7.2.1 to 7.10.2 and we have developed custom Rollup and Alias management plugin using NodeClient. In our 3 node set up, all nodes are master, data, ingest nodes. We did not had any issues with plugin or cluster in 7.2.1 version.
But when we upgrade to elasticsearch 7.10.2, when Jobs of Rollup plugin runs we see some times searches timeout (even with 60 second timeout) and sometimes index creation fails and eventually after running for ~ 2 hours , cluster becomes Red with no master or one of the node becomes out of the cluster.
Using Nodeclient and Indexdmin we make couple of calls, Timeout of these calls are at random.
Some times other calls like bulkinsert fails due to Index does not exists.

Couple of API calls that are causing timeout even after 60 seconds.

GetIndexResponse getIndexResponse = esClient.admin().indices().getIndex(request).actionGet(new TimeValue(60000));
searchSourceBuilder.query(QueryBuilders.rangeQuery(RollupConstants.ROLLUP_TIMESTAMP_FIELD).from(start).to(end).format(RollupConstants.ROLLUP_EPOCH_FORMAT));
SearchResponse searchResponse = esClient.prepareSearch(idx).setSource(searchSourceBuilder).get(new TimeValue(60000));

Errors as seen in logs

**[2021-05-02T14:16:39,299][DEBUG][o.e.c.c.LeaderChecker ] [elasticsearch-workernode-2] leader [{elasticsearch-workernode-1}{U-cUiAEcReiR2t_rQdlr7g}{E5hjAPRYSHC4_k-gsCnONg}{192.168.245.81}{192.168.245.81:9300}{dimr}] has failed 3 consecutive checks (limit [cluster.fault_detection.leader_check.retry_count] is 3); last failure was:**
**org.elasticsearch.transport.RemoteTransportException: [elasticsearch-workernode-1][192.168.245.81:9300][internal:coordination/fault_detection/leader_check]**
**Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [{elasticsearch-workernode-2}{RFYupN13R7O858OJTigTZQ}{-Faf5iHxR32loMgKmQRDTg}{192.168.245.16}{192.168.245.16:9300}{dimr}] has been removed from the cluster**
** at org.elasticsearch.cluster.coordination.LeaderChecker.handleLeaderCheck(LeaderChecker.java:192) ~[elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.cluster.coordination.LeaderChecker.lambda$new$0(LeaderChecker.java:113) ~[elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:72) ~[elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:207) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:107) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:89) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:700) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:142) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:117) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:82) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74) [transport-netty4-client-7.10.2.jar:7.10.2]**
**[2021-05-02T14:16:39,414][DEBUG][o.e.c.c.PreVoteCollector ] [elasticsearch-workernode-2] TransportResponseHandler{PreVoteCollector{state=Tuple [v1=null, v2=PreVoteResponse{currentTerm=1, lastAcceptedTerm=1, lastAcceptedVersion=254}]}, node={elasticsearch-workernode-1}{U-cUiAEcReiR2t_rQdlr7g}{E5hjAPRYSHC4_k-gsCnONg}{192.168.245.81}{192.168.245.81:9300}{dimr}} failed**
**org.elasticsearch.transport.RemoteTransportException: [elasticsearch-workernode-1][192.168.245.81:9300][internal:cluster/request_pre_vote]**
**Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting PreVoteRequest{sourceNode={elasticsearch-workernode-2}{RFYupN13R7O858OJTigTZQ}{-Faf5iHxR32loMgKmQRDTg}{192.168.245.16}{192.168.245.16:9300}{dimr}, currentTerm=1} as there is already a leader**
**	at org.elasticsearch.cluster.coordination.PreVoteCollector.handlePreVoteRequest(PreVoteCollector.java:135) ~[elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.cluster.coordination.PreVoteCollector.lambda$new$0(PreVoteCollector.java:74) ~[elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:72) ~[elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:305) ~[elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:743) ~[elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.10.2.jar:7.10.2]**
**	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]**
**	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]**
**	at java.lang.Thread.run(Unknown Source) [?:?]**

**[2021-05-02T16:03:37,972][DEBUG][o.e.a.s.m.TransportMasterNodeAction] [elasticsearch-workernode-2] timed out while retrying [indices:admin/get] after failure (timeout [30s])**
**[2021-05-02T16:03:37,973][WARN ][r.suppressed             ] [elasticsearch-workernode-2] path: /mnr-metadata, params: {ignore_unavailable=false, expand_wildcards=open,closed, allow_no_indices=false, ignore_throttled=false, index=mnr-metadata}**
**org.elasticsearch.discovery.MasterNotDiscoveredException: null**
**	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onTimeout(TransportMasterNodeAction.java:230) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:335) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:601) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) [elasticsearch-7.10.2.jar:7.10.2]**
**	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]**
**	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]**
**	at java.lang.Thread.run(Unknown Source) [?:?]**
**[2021-05-02T16:03:39,485][WARN ][r.suppressed             ] [elasticsearch-workernode-2] path: /_bulk, params: {}**
**org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];**
**	at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:190) ~[elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:590) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:452) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$2.onTimeout(TransportBulkAction.java:624) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:335) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:601) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) [elasticsearch-7.10.2.jar:7.10.2]**
**	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]**
**	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]**
**	at java.lang.Thread.run(Unknown Source) [?:?]**

I have checked the configuration of cluster settings and could not see any glaring changes.

Can you please let me know why the cluster becomess unstable when couple of API calls being made on IndicesAdminClient of NodeClient from Plugin causes the cluster to go down.

Thanks in Advance.

warkolm · May 3, 2021, 7:31am

Can you share you config?
What is the output from the _cluster/stats?pretty&human API?

DavidTurner · May 3, 2021, 7:51am

None of the few log messages you have shared so far seem especially relevant, and you seem to have turned on DEBUG logging which is only for expert use and seems to be causing confusion. Would you share some more logs, covering the period in which the problem starts and spanning at least 10 minutes, with the default log configuration instead?

Also please check your posts in the preview window before sending them, they're still pretty unreadable, as they are full of spurious ** markers and strange bold bits and other stuff that makes them harder to understand. Badly-formatted posts tend not to get useful answers, they're just too much effort to be bothered to read.

rpremnath · May 3, 2021, 4:46pm

We found the issue was caused by Lockservice of Our Plugin. After fixing the lock service, it appears Rollup is working fine. will continue to monitor and if any issues will post back with required info.
Thanks for your help.

system · May 31, 2021, 4:46pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Rolling upgrade 6.8 cluster to 7.10, node cannot connect back to cluster Elastic Search painless	2	24	November 7, 2024
Elasticsearch rolling upgrade failed Elasticsearch	9	198	August 14, 2024
Upgrades causing Elastic Search downtime Elasticsearch	9	485	July 6, 2017
Cluster won't start after upgrade from 7.13.1 to 7.13.2 Elasticsearch	9	1141	July 17, 2021
Rolling upgrade problem from 6.8 to 7.1.1 Elasticsearch	18	2930	July 18, 2019

Cluster is frequently lost with no master and data nodes removed after upgrade from 7.2.1 to 7.10.2

Related topics