Elasticsearch version ( bin/elasticsearch --version
): 7.4
Plugins installed : none
JVM version ( java -version
):11.0.5
OS version ( uname -a
if on a Unix-like system): aws ec2 linux
Description of the problem including expected versus actual behavior : Our cluster keeps going into yellow state mentioning that master not discovered this started happening after we moved to 7.4. We have tried everything including increasing capacity we have around 180 data nodes and 3 master nodes.
Provide logs (if relevant) :
we are seeing these logs when the cluster goes into yellow:
[2019-12-17T18:05:23,358][WARN ][o.e.a.s.TransportClearScrollAction] [query-0-17x.xx.xx.x]Clear SC failed on node[{data-0-172.30.201.95}{IWIzDcbLSIu0JSBBrdM9lw}{WGxWMOG2S86b2ltx2ANgeQ}{}{host=17x.xx.xx rack_id=us-east-1a, ml.machine_memory=64385785856, ml.max_open_jobs=20, xpack.installed=true}]
org.elasticsearch.transport.RemoteTransportException: [data-0-172.xx.xx][[indices:data/read/search[free_context/scroll]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [31961865140/29.7gb], which is larger than the limit of [30386474188/28.2gb], real usage: [31961864696/29.7gb], new bytes reserved: [444/444b], usages [request=0/0b, fielddata=21490892/20.4mb, in_flight_requests=444/444b, accounting=4658750179/4.3gb]
at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:343) ~[elasticsearch-7.4.1.jar:7.4.1]
2019-12-17T18:07:54,493][WARN ][o.e.c.NodeConnectionsService] [master-0-xx.xx.xx]failed to connect to {data-0-1xx.xx.xx}{lfhAg0FlRae3DVGeavemyQ}{TRxRnfcZTIaxGvXS6NLxFg}}{host=17x.xx.xx rack_id=us-east-1b, ml.machine_memory=73758015488, ml.max_open_jobs=20, xpack.installed=true} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [data-0-172..xxxx] connect_exception
at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:976) ~[elasticsearch-7.4.1.jar:7.4.1]
at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$3(ActionListener.java:161) ~[elasticsearch-7.4.1.jar:7.4.1]
at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:42) ~[elasticsearch-core-7.4.1.jar:7.4.1]
[2019-12-17T18:05:24,237][WARN ][o.e.c.r.a.AllocationService] [master-0-172.xx.xx.xx]failing shard [failed shard, shard [complete-tagged-2019-12-v2][131], node[IWIzDcbLSIu0JSBBrdM9lw], [R], s[STARTED], a[id=XhQMwYoCTPuqZ5MlXSHRdg], message [failed to perform indices:data/write/bulk[s] on replica [complete-tagged-2019-12-v2][131], node[IWIzDcbLSIu0JSBBrdM9lw], [R], s[STARTED], a[id=XhQMwYoCTPuqZ5MlXSHRdg]], failure [RemoteTransportException[[data-0-172.30.201.95][172.30.201.95:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [30519071200/28.4gb], which is larger than the limit of [30386474188/28.2gb], real usage: [30519024120/28.4gb], new bytes reserved: [47080/45.9kb], usages [request=0/0b, fielddata=21490892/20.4mb, in_flight_requests=47520/46.4kb, accounting=4658686279/4.3gb]]; ], markAsStale [true]]
| Dec 17 12:43:44.236 | i-0a8b8e974fc8db3b6 | elasticsearch | | [2019-12-17T17:43:44,236][WARN ][o.e.c.c.ClusterFormationFailureHelper] [query-0-1xx.xx.xx]master not discovered yet: have discovered [{query-0-172.30.201.187}{y_8CAEETRHmGnl-Vptim-A}{dgnRXCGkRbiTNsWVQyGpaA}{17xx.xx.xx}{1xx.xx.xx:xxx}{il}{rack_id=us-east-1a, ml.machine_memory=133658669056, xpack.installed=true, host=17xx.xx.xx, ml.max_open_jobs=20}, {master-0-1xx.xx.xx}{UJpnhA0oQLWyxeT2B8NiDA}{oBTXtCl0RNuKGIHAFhPMcA}