we are testing upgrade of Elasticsearch from 7.2.1 to 7.10.2 and we have developed custom Rollup and Alias management plugin using NodeClient. In our 3 node set up, all nodes are master, data, ingest nodes. We did not had any issues with plugin or cluster in 7.2.1 version.
But when we upgrade to elasticsearch 7.10.2, when Jobs of Rollup plugin runs we see some times searches timeout (even with 60 second timeout) and sometimes index creation fails and eventually after running for ~ 2 hours , cluster becomes Red with no master or one of the node becomes out of the cluster.
Using Nodeclient and Indexdmin we make couple of calls, Timeout of these calls are at random.
Some times other calls like bulkinsert fails due to Index does not exists.
Timeout
GetIndexResponse getIndexResponse = esClient.admin().indices().getIndex(request).actionGet(new TimeValue(60000));
Timeout
-------
``searchSourceBuilder.query(QueryBuilders.rangeQuery(RollupConstants.ROLLUP_TIMESTAMP_FIELD).from(start).to(end)*
-
.format(RollupConstants.ROLLUP_EPOCH_FORMAT));*
-
SearchResponse searchResponse = esClient.prepareSearch(idx).setSource(searchSourceBuilder)*
-
.get(new TimeValue(60000));``
Errors as seen
[2021-05-02T14:16:39,299][DEBUG][o.e.c.c.LeaderChecker ] [elasticsearch-workernode-2] leader [{elasticsearch-workernode-1}{U-cUiAEcReiR2t_rQdlr7g}{E5hjAPRYSHC4_k-gsCnONg}{192.168.245.81}{192.168.245.81:9300}{dimr}] has failed 3 consecutive checks (limit [cluster.fault_detection.leader_check.retry_count] is 3); last failure was:
org.elasticsearch.transport.RemoteTransportException: [elasticsearch-workernode-1][192.168.245.81:9300][internal:coordination/fault_detection/leader_check]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [{elasticsearch-workernode-2}{RFYupN13R7O858OJTigTZQ}{-Faf5iHxR32loMgKmQRDTg}{192.168.245.16}{192.168.245.16:9300}{dimr}] has been removed from the cluster
** at org.elasticsearch.cluster.coordination.LeaderChecker.handleLeaderCheck(LeaderChecker.java:192) ~[elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.cluster.coordination.LeaderChecker.lambda$new$0(LeaderChecker.java:113) ~[elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:72) ~[elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:207) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:107) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:89) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:700) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:142) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:117) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:82) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74) [transport-netty4-client-7.10.2.jar:7.10.2]**
**[2021-05-02T14:16:39,414][DEBUG][o.e.c.c.PreVoteCollector ] [elasticsearch-workernode-2] TransportResponseHandler{PreVoteCollector{state=Tuple [v1=null, v2=PreVoteResponse{currentTerm=1, lastAcceptedTerm=1, lastAcceptedVersion=254}]}, node={elasticsearch-workernode-1}{U-cUiAEcReiR2t_rQdlr7g}{E5hjAPRYSHC4_k-gsCnONg}{192.168.245.81}{192.168.245.81:9300}{dimr}} failed**
**org.elasticsearch.transport.RemoteTransportException: [elasticsearch-workernode-1][192.168.245.81:9300][internal:cluster/request_pre_vote]**
**Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting PreVoteRequest{sourceNode={elasticsearch-workernode-2}{RFYupN13R7O858OJTigTZQ}{-Faf5iHxR32loMgKmQRDTg}{192.168.245.16}{192.168.245.16:9300}{dimr}, currentTerm=1} as there is already a leader**
** at org.elasticsearch.cluster.coordination.PreVoteCollector.handlePreVoteRequest(PreVoteCollector.java:135) ~[elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.cluster.coordination.PreVoteCollector.lambda$new$0(PreVoteCollector.java:74) ~[elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:72) ~[elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:305) ~[elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:743) ~[elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.10.2.jar:7.10.2]**
** at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]**
** at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]**
** at java.lang.Thread.run(Unknown Source) [?:?]**
**[2021-05-02T16:03:37,972][DEBUG][o.e.a.s.m.TransportMasterNodeAction] [elasticsearch-workernode-2] timed out while retrying [indices:admin/get] after failure (timeout [30s])**
**[2021-05-02T16:03:37,973][WARN ][r.suppressed ] [elasticsearch-workernode-2] path: /mnr-metadata, params: {ignore_unavailable=false, expand_wildcards=open,closed, allow_no_indices=false, ignore_throttled=false, index=mnr-metadata}**
**org.elasticsearch.discovery.MasterNotDiscoveredException: null**
** at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onTimeout(TransportMasterNodeAction.java:230) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:335) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:601) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) [elasticsearch-7.10.2.jar:7.10.2]**
** at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]**
** at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]**
** at java.lang.Thread.run(Unknown Source) [?:?]**
**[2021-05-02T16:03:39,485][WARN ][r.suppressed ] [elasticsearch-workernode-2] path: /_bulk, params: {}**
**org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];**
** at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:190) ~[elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:590) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:452) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$2.onTimeout(TransportBulkAction.java:624) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:335) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:601) [elasticsearch-7.10.2.jar:7.10.2]**
** at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) [elasticsearch-7.10.2.jar:7.10.2]**
** at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]**
** at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]**
** at java.lang.Thread.run(Unknown Source) [?:?]**
I have checked the configuration of cluster settings and could not see any glaring changes.
Can you please let me know why the cluster becomess unstable when couple of API calls being made on IndicesAdminClient of NodeClient from Plugin causes the cluster to go down.
Thanks in Advance.