Cluster is frequently lost with no master and data nodes removed after upgrade from 7.2.1 to 7.10.2

we are testing upgrade of Elasticsearch from 7.2.1 to 7.10.2 and we have developed custom Rollup and Alias management plugin using NodeClient. In our 3 node set up, all nodes are master, data, ingest nodes. We did not had any issues with plugin or cluster in 7.2.1 version.
But when we upgrade to elasticsearch 7.10.2, when Jobs of Rollup plugin runs we see some times searches timeout (even with 60 second timeout) and sometimes index creation fails and eventually after running for ~ 2 hours , cluster becomes Red with no master or one of the node becomes out of the cluster.
Using Nodeclient and Indexdmin we make couple of calls, Timeout of these calls are at random.
Some times other calls like bulkinsert fails due to Index does not exists.
Timeout

GetIndexResponse getIndexResponse = esClient.admin().indices().getIndex(request).actionGet(new TimeValue(60000));

Timeout
-------
``searchSourceBuilder.query(QueryBuilders.rangeQuery(RollupConstants.ROLLUP_TIMESTAMP_FIELD).from(start).to(end)*

  •            .format(RollupConstants.ROLLUP_EPOCH_FORMAT));*
    
  •    SearchResponse searchResponse = esClient.prepareSearch(idx).setSource(searchSourceBuilder)*
    
  •            .get(new TimeValue(60000));``
    

Errors as seen
[2021-05-02T14:16:39,299][DEBUG][o.e.c.c.LeaderChecker ] [elasticsearch-workernode-2] leader [{elasticsearch-workernode-1}{U-cUiAEcReiR2t_rQdlr7g}{E5hjAPRYSHC4_k-gsCnONg}{192.168.245.81}{192.168.245.81:9300}{dimr}] has failed 3 consecutive checks (limit [cluster.fault_detection.leader_check.retry_count] is 3); last failure was:
org.elasticsearch.transport.RemoteTransportException: [elasticsearch-workernode-1][192.168.245.81:9300][internal:coordination/fault_detection/leader_check]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [{elasticsearch-workernode-2}{RFYupN13R7O858OJTigTZQ}{-Faf5iHxR32loMgKmQRDTg}{192.168.245.16}{192.168.245.16:9300}{dimr}] has been removed from the cluster
** at org.elasticsearch.cluster.coordination.LeaderChecker.handleLeaderCheck(LeaderChecker.java:192) ~[elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.cluster.coordination.LeaderChecker.lambda$new$0(LeaderChecker.java:113) ~[elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:72) ~[elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:207) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:107) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:89) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:700) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:142) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:117) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:82) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74) [transport-netty4-client-7.10.2.jar:7.10.2]**

**[2021-05-02T14:16:39,414][DEBUG][o.e.c.c.PreVoteCollector ] [elasticsearch-workernode-2] TransportResponseHandler{PreVoteCollector{state=Tuple [v1=null, v2=PreVoteResponse{currentTerm=1, lastAcceptedTerm=1, lastAcceptedVersion=254}]}, node={elasticsearch-workernode-1}{U-cUiAEcReiR2t_rQdlr7g}{E5hjAPRYSHC4_k-gsCnONg}{192.168.245.81}{192.168.245.81:9300}{dimr}} failed**
**org.elasticsearch.transport.RemoteTransportException: [elasticsearch-workernode-1][192.168.245.81:9300][internal:cluster/request_pre_vote]**
**Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting PreVoteRequest{sourceNode={elasticsearch-workernode-2}{RFYupN13R7O858OJTigTZQ}{-Faf5iHxR32loMgKmQRDTg}{192.168.245.16}{192.168.245.16:9300}{dimr}, currentTerm=1} as there is already a leader**
**	at org.elasticsearch.cluster.coordination.PreVoteCollector.handlePreVoteRequest(PreVoteCollector.java:135) ~[elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.cluster.coordination.PreVoteCollector.lambda$new$0(PreVoteCollector.java:74) ~[elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:72) ~[elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:305) ~[elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:743) ~[elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.10.2.jar:7.10.2]**
**	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]**
**	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]**
**	at java.lang.Thread.run(Unknown Source) [?:?]**

**[2021-05-02T16:03:37,972][DEBUG][o.e.a.s.m.TransportMasterNodeAction] [elasticsearch-workernode-2] timed out while retrying [indices:admin/get] after failure (timeout [30s])**
**[2021-05-02T16:03:37,973][WARN ][r.suppressed             ] [elasticsearch-workernode-2] path: /mnr-metadata, params: {ignore_unavailable=false, expand_wildcards=open,closed, allow_no_indices=false, ignore_throttled=false, index=mnr-metadata}**
**org.elasticsearch.discovery.MasterNotDiscoveredException: null**
**	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onTimeout(TransportMasterNodeAction.java:230) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:335) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:601) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) [elasticsearch-7.10.2.jar:7.10.2]**
**	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]**
**	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]**
**	at java.lang.Thread.run(Unknown Source) [?:?]**
**[2021-05-02T16:03:39,485][WARN ][r.suppressed             ] [elasticsearch-workernode-2] path: /_bulk, params: {}**
**org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];**
**	at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:190) ~[elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:590) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:452) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$2.onTimeout(TransportBulkAction.java:624) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:335) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:601) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) [elasticsearch-7.10.2.jar:7.10.2]**
**	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]**
**	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]**
**	at java.lang.Thread.run(Unknown Source) [?:?]**

I have checked the configuration of cluster settings and could not see any glaring changes.

Can you please let me know why the cluster becomess unstable when couple of API calls being made on IndicesAdminClient of NodeClient from Plugin causes the cluster to go down.

Thanks in Advance.

Welcome to our community! :smiley:

Please format your code/logs/config using the </> button, or markdown style back ticks. It helps to make things easy to read which helps us help you :slight_smile:

Thank you Mark . Here is the formatted request of the issue.

we are testing upgrade of Elasticsearch from 7.2.1 to 7.10.2 and we have developed custom Rollup and Alias management plugin using NodeClient. In our 3 node set up, all nodes are master, data, ingest nodes. We did not had any issues with plugin or cluster in 7.2.1 version.
But when we upgrade to elasticsearch 7.10.2, when Jobs of Rollup plugin runs we see some times searches timeout (even with 60 second timeout) and sometimes index creation fails and eventually after running for ~ 2 hours , cluster becomes Red with no master or one of the node becomes out of the cluster.
Using Nodeclient and Indexdmin we make couple of calls, Timeout of these calls are at random.
Some times other calls like bulkinsert fails due to Index does not exists.

Couple of API calls that are causing timeout even after 60 seconds.

GetIndexResponse getIndexResponse = esClient.admin().indices().getIndex(request).actionGet(new TimeValue(60000));
searchSourceBuilder.query(QueryBuilders.rangeQuery(RollupConstants.ROLLUP_TIMESTAMP_FIELD).from(start).to(end).format(RollupConstants.ROLLUP_EPOCH_FORMAT));
SearchResponse searchResponse = esClient.prepareSearch(idx).setSource(searchSourceBuilder).get(new TimeValue(60000));

Errors as seen in logs

**[2021-05-02T14:16:39,299][DEBUG][o.e.c.c.LeaderChecker ] [elasticsearch-workernode-2] leader [{elasticsearch-workernode-1}{U-cUiAEcReiR2t_rQdlr7g}{E5hjAPRYSHC4_k-gsCnONg}{192.168.245.81}{192.168.245.81:9300}{dimr}] has failed 3 consecutive checks (limit [cluster.fault_detection.leader_check.retry_count] is 3); last failure was:**
**org.elasticsearch.transport.RemoteTransportException: [elasticsearch-workernode-1][192.168.245.81:9300][internal:coordination/fault_detection/leader_check]**
**Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [{elasticsearch-workernode-2}{RFYupN13R7O858OJTigTZQ}{-Faf5iHxR32loMgKmQRDTg}{192.168.245.16}{192.168.245.16:9300}{dimr}] has been removed from the cluster**
** at org.elasticsearch.cluster.coordination.LeaderChecker.handleLeaderCheck(LeaderChecker.java:192) ~[elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.cluster.coordination.LeaderChecker.lambda$new$0(LeaderChecker.java:113) ~[elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:72) ~[elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:207) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:107) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:89) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:700) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:142) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:117) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:82) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74) [transport-netty4-client-7.10.2.jar:7.10.2]**
**[2021-05-02T14:16:39,414][DEBUG][o.e.c.c.PreVoteCollector ] [elasticsearch-workernode-2] TransportResponseHandler{PreVoteCollector{state=Tuple [v1=null, v2=PreVoteResponse{currentTerm=1, lastAcceptedTerm=1, lastAcceptedVersion=254}]}, node={elasticsearch-workernode-1}{U-cUiAEcReiR2t_rQdlr7g}{E5hjAPRYSHC4_k-gsCnONg}{192.168.245.81}{192.168.245.81:9300}{dimr}} failed**
**org.elasticsearch.transport.RemoteTransportException: [elasticsearch-workernode-1][192.168.245.81:9300][internal:cluster/request_pre_vote]**
**Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting PreVoteRequest{sourceNode={elasticsearch-workernode-2}{RFYupN13R7O858OJTigTZQ}{-Faf5iHxR32loMgKmQRDTg}{192.168.245.16}{192.168.245.16:9300}{dimr}, currentTerm=1} as there is already a leader**
**	at org.elasticsearch.cluster.coordination.PreVoteCollector.handlePreVoteRequest(PreVoteCollector.java:135) ~[elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.cluster.coordination.PreVoteCollector.lambda$new$0(PreVoteCollector.java:74) ~[elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:72) ~[elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:305) ~[elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:743) ~[elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.10.2.jar:7.10.2]**
**	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]**
**	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]**
**	at java.lang.Thread.run(Unknown Source) [?:?]**

**[2021-05-02T16:03:37,972][DEBUG][o.e.a.s.m.TransportMasterNodeAction] [elasticsearch-workernode-2] timed out while retrying [indices:admin/get] after failure (timeout [30s])**
**[2021-05-02T16:03:37,973][WARN ][r.suppressed             ] [elasticsearch-workernode-2] path: /mnr-metadata, params: {ignore_unavailable=false, expand_wildcards=open,closed, allow_no_indices=false, ignore_throttled=false, index=mnr-metadata}**
**org.elasticsearch.discovery.MasterNotDiscoveredException: null**
**	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onTimeout(TransportMasterNodeAction.java:230) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:335) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:601) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) [elasticsearch-7.10.2.jar:7.10.2]**
**	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]**
**	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]**
**	at java.lang.Thread.run(Unknown Source) [?:?]**
**[2021-05-02T16:03:39,485][WARN ][r.suppressed             ] [elasticsearch-workernode-2] path: /_bulk, params: {}**
**org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];**
**	at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:190) ~[elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:590) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:452) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$2.onTimeout(TransportBulkAction.java:624) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:335) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:601) [elasticsearch-7.10.2.jar:7.10.2]**
**	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) [elasticsearch-7.10.2.jar:7.10.2]**
**	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]**
**	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]**
**	at java.lang.Thread.run(Unknown Source) [?:?]**

I have checked the configuration of cluster settings and could not see any glaring changes.

Can you please let me know why the cluster becomess unstable when couple of API calls being made on IndicesAdminClient of NodeClient from Plugin causes the cluster to go down.

Thanks in Advance.

Can you share you config?
What is the output from the _cluster/stats?pretty&human API?

None of the few log messages you have shared so far seem especially relevant, and you seem to have turned on DEBUG logging which is only for expert use and seems to be causing confusion. Would you share some more logs, covering the period in which the problem starts and spanning at least 10 minutes, with the default log configuration instead?

Also please check your posts in the preview window before sending them, they're still pretty unreadable, as they are full of spurious ** markers and strange bold bits and other stuff that makes them harder to understand. Badly-formatted posts tend not to get useful answers, they're just too much effort to be bothered to read.

1 Like

We found the issue was caused by Lockservice of Our Plugin. After fixing the lock service, it appears Rollup is working fine. will continue to monitor and if any issues will post back with required info.
Thanks for your help.