Failing cluster

Hello,
I have a problem with my elasticsearch cluster. (ES 7.14.0)

I have virtualized cluster with setup like:

  • elk-node-1 (master, small ssd storage, hot data, ingester, ...)
  • elk-node-2 (warm data, cold data)
  • NumberOfReplicas: 0
  • Data are logs from infra

It stared working smoothly so I didn't care about it but suddenly there started being some problems. IDK when because I had no monitoring in past. Just internal metrics 7d backwards.

Problems are that my cluster is quite often (sometimes it's period shorter than a day) in red state. It looks that one of nodes is disconnected from cluster )(like on picture below)_ but both nodes are connected together in one virtualization platform in one rack.

I can see logs on master like ...
ILM policy started -> completed
... a lot of treshold warnings like

[2021-08-25T11:01:46,746][WARN ][o.e.t.OutboundHandler    ] [elk-node-1] sending transport message [Request{indices:data/read/field_caps[index][s]}{10754477}{false}{false}{false}] of size [169391] on [Netty4TcpChannel{localAddress=/10.31.0.18:51802, remoteAddress=10.31.0.19/10.31.0.19:9300, profile=default}] took [51256ms] which is above the warn threshold of [5000ms] with success [true]

and then fail to red with reason:

[2021-08-25T11:02:25,255][INFO ][o.e.c.r.a.AllocationService] [elk-node-1] Cluster health status changed from [GREEN] to [RED] (reason: [{elk-node-2}{XYZ}{XYZ}{10.31.0.19}{10.31.0.19:9300}{csw} reason: followers check retry count exceeded]).

but then immediately starts to rejoining (sometimes with failure and it starts again with rejoining)


On slave I can see some logs after the fail (not before - idk why)

After that is everything OK. I'm not sure if it's caused by heap usage, because some ILM tasks succeed with no issue like this but another fails cluster to RED

When this happens, I can see massive GC than before and heap usage goes down.

Thanks for any suggestions

Why do you have over 2600 shards for less than 100GB of data? Shards are not free so I would recommend you look to reduce the shard count significantly as oversharding can cause performance as well as stability problems.

1 Like

Because I have smth like 20 different apps with daily index rate with long retention period.
I can try to merge them but I'm not sure how (I will try to do it)

image

I would recommend merging different indices into fewer larger ones and switching from daily indices to weekly or monthly (depending on how long your retention period is).