We have recently migrated to ES 7.6.2 from 6.5.2 and are seeing cluster instability since the upgrade. We see the below behavior from the Elastic Search cluster:
Nodes in the cluster leave and join back after some time on their own. Elastic Search is up and running in the nods/server but it loses connection to cluster resulting in Node Not connected exception.
org.elasticsearch.transport.NodeNotConnectedException: xxxxxxxxxxx Node not connected at
CPU utilization reaches 100 % and we see a very high traffic(45mb/s) being exchanged between nodes.
Elastic takes a long time to response or start throwing below exception:
Data too large, data for [<transport_request>] would be [12601975324/11.7gb], which is larger than the limit of [12240656793/11.3gb], real usage: [12601974736/11.7gb], new bytes reserved: [588/588b], usages [request=0/0b, fielddata=2593993/2.4mb, in_flight_requests=588/588b, accounting=17384984/16.5mb]
Cluster is running with:
- 4 Servers/nodes( AWS EC2), each with 4 Core CPU ,16Gig of memory and 12Gig Heap.
- 25 indices, 5 shards per index and 1 replica
- Data is 5 Million docs with nested fields. Size is around 400Gig(including replica)
- Elastic Search running on G1 GC.
We are in the process of scaling up the servers by doubling CPU and memory and limit heap to 12 or 16Gig but we are not sure if this would resolve the issue as the same design and server configuration were working fine with Elastic search 6.5.2.
Please suggest of there is any issue with our setup.