Cluster Yellow with master not discovered error and other errors after 7.4 upgrade

Elasticsearch version ( bin/elasticsearch --version ): 7.4

Plugins installed : none

JVM version ( java -version ):11.0.5

OS version ( uname -a if on a Unix-like system): aws ec2 linux

Description of the problem including expected versus actual behavior : Our cluster keeps going into yellow state mentioning that master not discovered this started happening after we moved to 7.4. We have tried everything including increasing capacity we have around 180 data nodes and 3 master nodes.

Provide logs (if relevant) :

we are seeing these logs when the cluster goes into yellow:

[2019-12-17T18:05:23,358][WARN ][o.e.a.s.TransportClearScrollAction] [query-0-17x.xx.xx.x]Clear SC failed on node[{data-0-172.30.201.95}{IWIzDcbLSIu0JSBBrdM9lw}{WGxWMOG2S86b2ltx2ANgeQ}{}{host=17x.xx.xx rack_id=us-east-1a, ml.machine_memory=64385785856, ml.max_open_jobs=20, xpack.installed=true}]
org.elasticsearch.transport.RemoteTransportException: [data-0-172.xx.xx][[indices:data/read/search[free_context/scroll]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [31961865140/29.7gb], which is larger than the limit of [30386474188/28.2gb], real usage: [31961864696/29.7gb], new bytes reserved: [444/444b], usages [request=0/0b, fielddata=21490892/20.4mb, in_flight_requests=444/444b, accounting=4658750179/4.3gb]
at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:343) ~[elasticsearch-7.4.1.jar:7.4.1]

2019-12-17T18:07:54,493][WARN ][o.e.c.NodeConnectionsService] [master-0-xx.xx.xx]failed to connect to {data-0-1xx.xx.xx}{lfhAg0FlRae3DVGeavemyQ}{TRxRnfcZTIaxGvXS6NLxFg}}{host=17x.xx.xx rack_id=us-east-1b, ml.machine_memory=73758015488, ml.max_open_jobs=20, xpack.installed=true} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [data-0-172..xxxx] connect_exception
at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:976) ~[elasticsearch-7.4.1.jar:7.4.1]
at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$3(ActionListener.java:161) ~[elasticsearch-7.4.1.jar:7.4.1]
at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:42) ~[elasticsearch-core-7.4.1.jar:7.4.1]

[2019-12-17T18:05:24,237][WARN ][o.e.c.r.a.AllocationService] [master-0-172.xx.xx.xx]failing shard [failed shard, shard [complete-tagged-2019-12-v2][131], node[IWIzDcbLSIu0JSBBrdM9lw], [R], s[STARTED], a[id=XhQMwYoCTPuqZ5MlXSHRdg], message [failed to perform indices:data/write/bulk[s] on replica [complete-tagged-2019-12-v2][131], node[IWIzDcbLSIu0JSBBrdM9lw], [R], s[STARTED], a[id=XhQMwYoCTPuqZ5MlXSHRdg]], failure [RemoteTransportException[[data-0-172.30.201.95][172.30.201.95:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [30519071200/28.4gb], which is larger than the limit of [30386474188/28.2gb], real usage: [30519024120/28.4gb], new bytes reserved: [47080/45.9kb], usages [request=0/0b, fielddata=21490892/20.4mb, in_flight_requests=47520/46.4kb, accounting=4658686279/4.3gb]]; ], markAsStale [true]]

| Dec 17 12:43:44.236 | i-0a8b8e974fc8db3b6 | elasticsearch | | [2019-12-17T17:43:44,236][WARN ][o.e.c.c.ClusterFormationFailureHelper] [query-0-1xx.xx.xx]master not discovered yet: have discovered [{query-0-172.30.201.187}{y_8CAEETRHmGnl-Vptim-A}{dgnRXCGkRbiTNsWVQyGpaA}{17xx.xx.xx}{1xx.xx.xx:xxx}{il}{rack_id=us-east-1a, ml.machine_memory=133658669056, xpack.installed=true, host=17xx.xx.xx, ml.max_open_jobs=20}, {master-0-1xx.xx.xx}{UJpnhA0oQLWyxeT2B8NiDA}{oBTXtCl0RNuKGIHAFhPMcA}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.