Hi - we have seen a couple of times since yesterday where some of the elastic nodes are crashing due to java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
error.
This happened on the client node, ingest node once and on a couple of warm data nodes & client node the other time. Opened a case with elastic earlier - and they suggested a few things in the past that we have already implemented but still notice the issue happening again.
And so reaching out here to get any additional insights to pin-point the exact root cause of the issue. The only logs we see when this happens are the below:
22 at org.elasticsearch.ExceptionsHelper.lambda$maybeDieOnAnotherThread$4(ExceptionsHelper.java:300)
23 at java.base/java.util.Optional.ifPresent(Optional.java:176)
24 at org.elasticsearch.ExceptionsHelper.maybeDieOnAnotherThread(ExceptionsHelper.java:290)
25 at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.exceptionCaught(Netty4MessageChannelHandler.java:70)
....
101 [2021-03-05T11:49:00,152][WARN ][o.e.t.TcpTransport ] [es-ingest03] exception caught on transport layer [Netty4TcpChannel{localAddress=/xxxx, remoteAddress=I***}], closing connection
102 java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
103 at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.exceptionCaught(Netty4MessageChannelHandler.java:75) [transport-netty4-client-7.5.0.jar:7.5.0]
- Total number of shards is less than 400 per node with 32GB heap on data nodes (which is less than the recommeded number of 600)
- Enabled slowlogs to identify the problem query but didn't see any activity around the time of outage.
- Cleared cache for fieldarray data a couple of weeks ago (not sure if this should be a recurring activity as part of operations)
- Reduced the index size from 300GB to 200GB (so, when they move to warm nodes - it will be 1 primary shard+1 secondary shard)
- Client node has 16GB memory assigned on a 48GB memory server (Kibana runs on the same ES client node)
I would really like to understand if there is anyother way of knowing the problem child so we can address the underlying issue. Please let me know if you have any ideas around it.
One recent change we had was the addition of a third data center. And we do perform cross cluster search on our key indices from the primary Kibana instance. Not sure if that is something we should be looking at to optimize.
We are running v7.5 elasticsearch on RHEL.
Thanks!
VK