Can't access our cluster through Grafana

mertkaant · December 12, 2021, 3:43pm

Hello, dear community,

We recently have upgraded our ELK to 7.15.2. from 7.7.0.
Here are some details from our cluster
This is one of our cluster's health:

{
  "cluster_name" : "xxxxxxxxxxx",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 9,
  "number_of_data_nodes" : 6,
  "active_primary_shards" : 2484,
  "active_shards" : 4968,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

We have 5 different regions and they all have 3 master nodes and 6 data nodes. All versions are the same (7.15.2) and also we have an RTP server which we are using that as a remote cluster.

Sometimes our Hot-2 data node goes crazy(Elasticsearch is still active and running) has some logs like this:

[2021-12-12T10:17:33,931][WARN ][o.e.t.OutboundHandler    ] [cul-elk-hot-2] sending transport message [Request{internal:admin/tasks/ban}{150643550}{false}{false}{false}] of size [347] on [Netty4TcpChannel{localAddress=/cul-elk-hot-2:40978, remoteAddress=cul-elk-warm-2/cul-elk-warm-2:9300, profile=default}] took [50940ms] which is above the warn threshold of [5000ms] with success [true]
[2021-12-12T10:17:33,931][WARN ][o.e.t.OutboundHandler    ] [cul-elk-hot-2] sending transport message [Request{indices:data/read/search[can_match]}{150643706}{false}{false}{false}] of size [75762] on [Netty4TcpChannel{localAddress=/cul-elk-hot-2:40978, remoteAddress=cul-elk-warm-2/cul-elk-warm-2:9300, profile=default}] took [46738ms] which is above the warn threshold of [5000ms] with success [true]
[2021-12-12T10:17:33,931][WARN ][o.e.t.OutboundHandler    ] [cul-elk-hot-2] sending transport message [Request{indices:data/read/search[can_match]}{150643963}{false}{false}{false}] of size [75764] on [Netty4TcpChannel{localAddress=/cul-elk-hot-2:40978, remoteAddress=cul-elk-cold-2/cul-elk-cold-2:9300, profile=default}] took [46738ms] which is above the warn threshold of [5000ms] with success [true]
[2021-12-12T10:17:33,931][WARN ][o.e.t.OutboundHandler    ] [cul-elk-hot-2] sending transport message [Request{indices:data/read/search[can_match]}{150643969}{false}{false}{false}] of size [75764] on [Netty4TcpChannel{localAddress=/cul-elk-hot-2:40978, remoteAddress=cul-elk-hot-1/cul-elk-hot-1:9300, profile=default}] took [46738ms] which is above the warn threshold of [5000ms] with success [true]
[2021-12-12T10:17:33,931][WARN ][o.e.t.OutboundHandler    ] [cul-elk-hot-2] sending transport message [Request{indices:data/read/search[can_match]}{150643975}{false}{false}{false}] of size [75764] on [Netty4TcpChannel{localAddress=/cul-elk-hot-2:40978, remoteAddress=cul-elk-cold-1/cul-elk-cold-1:9300, profile=default}] took [46738ms] which is above the warn threshold of [5000ms] with success [true]
[2021-12-12T10:17:33,931][WARN ][o.e.t.OutboundHandler    ] [cul-elk-hot-2] sending transport message [Request{indices:data/read/search[can_match]}{150643987}{false}{false}{false}] of size [75764] on [Netty4TcpChannel{localAddress=/cul-elk-hot-2:40978, remoteAddress=cul-elk-warm-1/cul-elk-warm-1:9300, profile=default}] took [46738ms] which is above the warn threshold of [5000ms] with success [true]
[2021-12-12T10:17:33,932][WARN ][o.e.t.OutboundHandler    ] [cul-elk-hot-2] sending transport message [Request{indices:data/read/search[can_match]}{150643999}{false}{false}{false}] of size [75764] on [Netty4TcpChannel{localAddress=/cul-elk-hot-2:40978, remoteAddress=cul-elk-warm-2/cul-elk-warm-2:9300, profile=default}] took [46738ms] which is above the warn threshold of [5000ms] with success [true]

By the way, we are using the Hot-2 node as the data source from the Grafana

When this happens we can't access the cluster from the Grafana data source section. When this happens I checked our cluster's health and nodes shards etc. they all are okay, the cluster is green and there are no unassigned shards every single node is active everything is fine. But somehow we can't access Grafana because we are getting the warnings above a lot from our Hot-2 data node.

We don't see these warning messages from other regions, this region has the most traffic and we are seeing those warning messages only in this specific region. I only shared 5 of those warnings but they are a lot. More than 500 lines maybe

I researched a lot but couldn't find anything please help us out. This is production and all the help we can get are really appreciated

system · January 9, 2022, 3:43pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.