Hello,
I have a problem with my elasticsearch cluster. (ES 7.14.0)
I have virtualized cluster with setup like:
- elk-node-1 (master, small ssd storage, hot data, ingester, ...)
- elk-node-2 (warm data, cold data)
- NumberOfReplicas: 0
- Data are logs from infra
It stared working smoothly so I didn't care about it but suddenly there started being some problems. IDK when because I had no monitoring in past. Just internal metrics 7d backwards.
Problems are that my cluster is quite often (sometimes it's period shorter than a day) in red state. It looks that one of nodes is disconnected from cluster )(like on picture below)_ but both nodes are connected together in one virtualization platform in one rack.
I can see logs on master like ...
ILM policy started -> completed
... a lot of treshold warnings like
[2021-08-25T11:01:46,746][WARN ][o.e.t.OutboundHandler ] [elk-node-1] sending transport message [Request{indices:data/read/field_caps[index][s]}{10754477}{false}{false}{false}] of size [169391] on [Netty4TcpChannel{localAddress=/10.31.0.18:51802, remoteAddress=10.31.0.19/10.31.0.19:9300, profile=default}] took [51256ms] which is above the warn threshold of [5000ms] with success [true]
and then fail to red with reason:
[2021-08-25T11:02:25,255][INFO ][o.e.c.r.a.AllocationService] [elk-node-1] Cluster health status changed from [GREEN] to [RED] (reason: [{elk-node-2}{XYZ}{XYZ}{10.31.0.19}{10.31.0.19:9300}{csw} reason: followers check retry count exceeded]).
but then immediately starts to rejoining (sometimes with failure and it starts again with rejoining)
On slave I can see some logs after the fail (not before - idk why)
After that is everything OK. I'm not sure if it's caused by heap usage, because some ILM tasks succeed with no issue like this but another fails cluster to RED
When this happens, I can see massive GC than before and heap usage goes down.
Thanks for any suggestions