Failing cluster

dusatvoj · August 25, 2021, 10:34am

Hello,
I have a problem with my elasticsearch cluster. (ES 7.14.0)

I have virtualized cluster with setup like:

elk-node-1 (master, small ssd storage, hot data, ingester, ...)
elk-node-2 (warm data, cold data)
NumberOfReplicas: 0
Data are logs from infra

It stared working smoothly so I didn't care about it but suddenly there started being some problems. IDK when because I had no monitoring in past. Just internal metrics 7d backwards.

Problems are that my cluster is quite often (sometimes it's period shorter than a day) in red state. It looks that one of nodes is disconnected from cluster )(like on picture below)_ but both nodes are connected together in one virtualization platform in one rack.

I can see logs on master like ...
ILM policy started -> completed
... a lot of treshold warnings like

[2021-08-25T11:01:46,746][WARN ][o.e.t.OutboundHandler    ] [elk-node-1] sending transport message [Request{indices:data/read/field_caps[index][s]}{10754477}{false}{false}{false}] of size [169391] on [Netty4TcpChannel{localAddress=/10.31.0.18:51802, remoteAddress=10.31.0.19/10.31.0.19:9300, profile=default}] took [51256ms] which is above the warn threshold of [5000ms] with success [true]

and then fail to red with reason:

[2021-08-25T11:02:25,255][INFO ][o.e.c.r.a.AllocationService] [elk-node-1] Cluster health status changed from [GREEN] to [RED] (reason: [{elk-node-2}{XYZ}{XYZ}{10.31.0.19}{10.31.0.19:9300}{csw} reason: followers check retry count exceeded]).

but then immediately starts to rejoining (sometimes with failure and it starts again with rejoining)

On slave I can see some logs after the fail (not before - idk why)

After that is everything OK. I'm not sure if it's caused by heap usage, because some ILM tasks succeed with no issue like this but another fails cluster to RED

When this happens, I can see massive GC than before and heap usage goes down.

Thanks for any suggestions

Christian_Dahlqvist · August 25, 2021, 5:05pm

Why do you have over 2600 shards for less than 100GB of data? Shards are not free so I would recommend you look to reduce the shard count significantly as oversharding can cause performance as well as stability problems.

dusatvoj · August 26, 2021, 12:41pm

Because I have smth like 20 different apps with daily index rate with long retention period.
I can try to merge them but I'm not sure how (I will try to do it)

Christian_Dahlqvist · August 26, 2021, 1:41pm

I would recommend merging different indices into fewer larger ones and switching from daily indices to weekly or monthly (depending on how long your retention period is).

system · September 23, 2021, 1:41pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cluster health red Elasticsearch	4	485	July 6, 2017
Stability issues with elasticsearch cluster Elasticsearch	6	1451	July 6, 2017
My health status is red Elasticsearch	6	890	July 5, 2017
Cluster health becomes red and and some shard is assigned and never be recoverd/assigned Elasticsearch	2	444	October 27, 2017
Mysterious "red" cluster status has happened ~4x now Elasticsearch	1	318	July 6, 2017

Failing cluster

Related topics